Hi Nardus, There is one more thing to be checked. 1) could you check if there are any packets sent from the affected host to the engine? on host: # outgoing traffic sudo tcpdump -i <interface_name_on_host> -c 1000 -ttttnnvvS dst <engine_host>
2) same the other way round. Check if there are packets received on engine side from affected host on engine: # incoming traffic sudo tcpdump -i <interface_name_on_engine> -c 1000 -ttttnnvvS src <affected_host> Artur On Thu, Aug 6, 2020 at 4:51 PM Artur Socha <aso...@redhat.com> wrote: > Thanks Nardus, > After a quick look I found what I was suspecting - there are way too many > threads in Blocked state. I don't know yet the reason but this is very > helpful. I'll let you know about the findings/investigation. Meanwhile, you > may try restarting the engine as (a very brute and ugly) workaround). > You may try to setup slightly bigger thread pool - may save you some time > until the next hiccup. However, please be aware that this may come with the > cost in memory usage and higher cpu usage (due to increased context > switching) > Here are some docs: > > # Specify the thread pool size for jboss managed scheduled executor service > used by commands to periodically execute > # methods. It is generally not necessary to increase the number of threads in > this thread pool. To change the value > # permanently create a conf file 99-engine-scheduled-thread-pool.conf in > /etc/ovirt-engine/engine.conf.d/ > ENGINE_SCHEDULED_THREAD_POOL_SIZE=100 > > > A. > > > On Thu, Aug 6, 2020 at 4:19 PM Nardus Geldenhuys <nard...@gmail.com> > wrote: > >> Hi Artur >> >> Please find attached, also let me know if I need to rerun. They 5 min >> apart >> >> [root@engine-aa-1-01 ovirt-engine]# ps -ef | grep jboss | grep -v grep >> | awk '{ print $2 }' >> 27390 >> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > >> your_engine_thread_dump_1.txt >> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > >> your_engine_thread_dump_2.txt >> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > >> your_engine_thread_dump_3.txt >> >> Regards >> >> Nar >> >> On Thu, 6 Aug 2020 at 15:55, Artur Socha <aso...@redhat.com> wrote: >> >>> Sure thing. >>> On engine host please find jboss pid. You can use this command: >>> >>> ps -ef | grep jboss | grep -v grep | awk '{ print $2 }' >>> >>> or jps tool from jdk. Sample output on my dev environment is: >>> >>> ± % jps >>> !2860 >>> 64853 jboss-modules.jar >>> 196217 Jps >>> >>> Then use jstack from jdk: >>> jstack <pid> > your_engine_thread_dump.txt >>> 2 or 3 dumps taken in approximately 5 minutes intervals would be even >>> more useful. >>> >>> Here you can find even more options >>> https://www.baeldung.com/java-thread-dump >>> >>> Artur >>> >>> On Thu, Aug 6, 2020 at 3:15 PM Nardus Geldenhuys <nard...@gmail.com> >>> wrote: >>> >>>> Hi >>>> >>>> Can create thread dump, please send details on howto. >>>> >>>> Regards >>>> >>>> Nardus >>>> >>>> On Thu, 6 Aug 2020 at 14:17, Artur Socha <aso...@redhat.com> wrote: >>>> >>>>> Hi Nardus, >>>>> You might have hit an issue I have been hunting for some time ( [1] >>>>> and [2] ). >>>>> [1] could not be properly resolved because at a time was not able to >>>>> recreate an issue on dev setup. >>>>> I suspect [2] is related. >>>>> >>>>> Would you be able to prepare a thread dump from your engine instance? >>>>> Additionally, please check for potential libvirt errors/warnings. >>>>> Can you also paste the output of: >>>>> sudo yum list installed | grep vdsm >>>>> sudo yum list installed | grep ovirt-engine >>>>> sudo yum list installed | grep libvirt >>>>> >>>>> Usually, according to previous reports, restarting the engine helps to >>>>> restore connectivity with hosts ... at least for some time. >>>>> >>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1845152 >>>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1846338 >>>>> >>>>> regards, >>>>> Artur >>>>> >>>>> >>>>> >>>>> On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys <nard...@gmail.com> >>>>> wrote: >>>>> >>>>>> Also see this in engine: >>>>>> >>>>>> Aug 6, 2020, 7:37:17 AM >>>>>> VDSM someserver command Get Host Capabilities failed: Message timeout >>>>>> which can be caused by communication issues >>>>>> >>>>>> On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86...@yahoo.com> >>>>>> wrote: >>>>>> >>>>>>> Can you fheck for errors on the affected host. Most probably you >>>>>>> need the vdsm logs. >>>>>>> >>>>>>> Best Regards, >>>>>>> Strahil Nikolov >>>>>>> >>>>>>> На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys < >>>>>>> nard...@gmail.com> написа: >>>>>>> >Hi Strahil >>>>>>> > >>>>>>> >Hope you are well. I get the following error when I tried to confirm >>>>>>> >reboot: >>>>>>> > >>>>>>> >Error while executing action: Cannot confirm 'Host has been >>>>>>> rebooted' >>>>>>> >Host. >>>>>>> >Valid Host statuses are "Non operational", "Maintenance" or >>>>>>> >"Connecting". >>>>>>> > >>>>>>> >And I can't put it in maintenance, only option is "restart" or >>>>>>> "stop". >>>>>>> > >>>>>>> >Regards >>>>>>> > >>>>>>> >Nar >>>>>>> > >>>>>>> >On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <hunter86...@yahoo.com >>>>>>> > >>>>>>> >wrote: >>>>>>> > >>>>>>> >> After rebooting the node, have you "marked" it that it was >>>>>>> rebooted ? >>>>>>> >> >>>>>>> >> Best Regards, >>>>>>> >> Strahil Nikolov >>>>>>> >> >>>>>>> >> На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < >>>>>>> >> nard...@gmail.com> написа: >>>>>>> >> >Hi oVirt land >>>>>>> >> > >>>>>>> >> >Hope you are well. Got a bit of an issue, actually a big issue. >>>>>>> We >>>>>>> >had >>>>>>> >> >some >>>>>>> >> >sort of dip of some sort. All the VM's is still running, but >>>>>>> some of >>>>>>> >> >the >>>>>>> >> >hosts is show "Unassigned" or "NonResponsive". So all the hosts >>>>>>> was >>>>>>> >> >showing >>>>>>> >> >UP and was fine before our dip. So I did increase >>>>>>> >vdsHeartbeatInSecond >>>>>>> >> >to >>>>>>> >> >240, no luck. >>>>>>> >> > >>>>>>> >> >I still get a timeout on the engine lock even thou I can connect >>>>>>> to >>>>>>> >> >that >>>>>>> >> >host from the engine using nc to test to port 54321. I also did >>>>>>> >restart >>>>>>> >> >vdsmd and also rebooted the host with no luck. >>>>>>> >> > >>>>>>> >> > nc -v someserver 54321 >>>>>>> >> >Ncat: Version 7.50 ( https://nmap.org/ncat ) >>>>>>> >> >Ncat: Connected to 172.40.2.172:54321. >>>>>>> >> > >>>>>>> >> >2020-08-05 20:20:34,256+02 ERROR >>>>>>> >> >>>>>>> >>>>>>> >>[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>>>> >> >(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: >>>>>>> >> >VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get >>>>>>> Host >>>>>>> >> >Capabilities failed: Message timeout which can be caused by >>>>>>> >> >communication >>>>>>> >> >issues >>>>>>> >> > >>>>>>> >> >Any troubleshoot ideas will be gladly appreciated. >>>>>>> >> > >>>>>>> >> >Regards >>>>>>> >> > >>>>>>> >> >Nar >>>>>>> >> >>>>>>> >>>>>> _______________________________________________ >>>>>> Users mailing list -- users@ovirt.org >>>>>> To unsubscribe send an email to users-le...@ovirt.org >>>>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html >>>>>> oVirt Code of Conduct: >>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>> List Archives: >>>>>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76FI2325Z4AV4VCCEKH4M3S/ >>>>>> >>>>> >>>>> >>>>> -- >>>>> Artur Socha >>>>> Senior Software Engineer, RHV >>>>> Red Hat >>>>> >>>> >>> >>> -- >>> Artur Socha >>> Senior Software Engineer, RHV >>> Red Hat >>> >> > > -- > Artur Socha > Senior Software Engineer, RHV > Red Hat > -- Artur Socha Senior Software Engineer, RHV Red Hat
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/VDRNQQB27R2K5EUIWV7FP4W36B2P2YP5/