Hi Nardus,
There is one more thing to be checked.

1) could you check if there are any packets sent from the affected host to
the engine?
on host:
# outgoing traffic
 sudo  tcpdump -i <interface_name_on_host> -c 1000 -ttttnnvvS dst
<engine_host>

2) same the other way round. Check if there are packets received on engine
side from affected host
on engine:
# incoming traffic
sudo  tcpdump -i <interface_name_on_engine> -c 1000 -ttttnnvvS src
<affected_host>

Artur


On Thu, Aug 6, 2020 at 4:51 PM Artur Socha <aso...@redhat.com> wrote:

> Thanks Nardus,
> After a quick look I found what I was suspecting - there are way too many
> threads in Blocked state. I don't know yet the reason but this is very
> helpful. I'll let you know about the findings/investigation. Meanwhile, you
> may try restarting the engine as (a very brute and ugly) workaround).
> You may try to setup slightly bigger thread pool - may save you some time
> until the next hiccup. However, please be aware that this may come with the
> cost in memory usage and higher cpu usage (due to increased context
> switching)
> Here are some docs:
>
> # Specify the thread pool size for jboss managed scheduled executor service 
> used by commands to periodically execute
> # methods. It is generally not necessary to increase the number of threads in 
> this thread pool. To change the value
> # permanently create a conf file 99-engine-scheduled-thread-pool.conf in 
> /etc/ovirt-engine/engine.conf.d/
> ENGINE_SCHEDULED_THREAD_POOL_SIZE=100
>
>
> A.
>
>
> On Thu, Aug 6, 2020 at 4:19 PM Nardus Geldenhuys <nard...@gmail.com>
> wrote:
>
>> Hi Artur
>>
>> Please find attached, also let me know if I need to rerun. They 5 min
>> apart
>>
>> [root@engine-aa-1-01 ovirt-engine]#  ps -ef | grep jboss | grep -v grep
>> | awk '{ print $2 }'
>> 27390
>> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 >
>> your_engine_thread_dump_1.txt
>> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 >
>> your_engine_thread_dump_2.txt
>> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 >
>> your_engine_thread_dump_3.txt
>>
>> Regards
>>
>> Nar
>>
>> On Thu, 6 Aug 2020 at 15:55, Artur Socha <aso...@redhat.com> wrote:
>>
>>> Sure thing.
>>> On engine host please find  jboss pid. You can use this command:
>>>
>>>  ps -ef | grep jboss | grep -v grep | awk '{ print $2 }'
>>>
>>> or jps tool from jdk. Sample output on my dev environment is:
>>>
>>> ± % jps
>>>                                                        !2860
>>> 64853 jboss-modules.jar
>>> 196217 Jps
>>>
>>> Then use jstack from jdk:
>>> jstack <pid>  > your_engine_thread_dump.txt
>>> 2 or 3 dumps taken in approximately 5 minutes intervals would be even
>>> more useful.
>>>
>>> Here you can find even more options
>>> https://www.baeldung.com/java-thread-dump
>>>
>>> Artur
>>>
>>> On Thu, Aug 6, 2020 at 3:15 PM Nardus Geldenhuys <nard...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> Can create thread dump, please send details on howto.
>>>>
>>>> Regards
>>>>
>>>> Nardus
>>>>
>>>> On Thu, 6 Aug 2020 at 14:17, Artur Socha <aso...@redhat.com> wrote:
>>>>
>>>>> Hi Nardus,
>>>>> You might have hit an issue I have been hunting for some time ( [1]
>>>>> and  [2] ).
>>>>> [1] could not be properly resolved because at a time was not able to
>>>>> recreate an issue on dev setup.
>>>>> I suspect [2] is related.
>>>>>
>>>>> Would you be able to prepare a thread dump from your engine instance?
>>>>> Additionally, please check for potential libvirt errors/warnings.
>>>>> Can you also paste the output of:
>>>>> sudo yum list installed | grep vdsm
>>>>> sudo yum list installed | grep ovirt-engine
>>>>> sudo yum list installed | grep libvirt
>>>>>
>>>>> Usually, according to previous reports, restarting the engine helps to
>>>>> restore connectivity with hosts ... at least for some time.
>>>>>
>>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1845152
>>>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1846338
>>>>>
>>>>> regards,
>>>>> Artur
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys <nard...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Also see this in engine:
>>>>>>
>>>>>> Aug 6, 2020, 7:37:17 AM
>>>>>> VDSM someserver command Get Host Capabilities failed: Message timeout
>>>>>> which can be caused by communication issues
>>>>>>
>>>>>> On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86...@yahoo.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Can you fheck for errors on the affected host. Most probably you
>>>>>>> need the vdsm logs.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Strahil Nikolov
>>>>>>>
>>>>>>> На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys <
>>>>>>> nard...@gmail.com> написа:
>>>>>>> >Hi Strahil
>>>>>>> >
>>>>>>> >Hope you are well. I get the following error when I tried to confirm
>>>>>>> >reboot:
>>>>>>> >
>>>>>>> >Error while executing action: Cannot confirm 'Host has been
>>>>>>> rebooted'
>>>>>>> >Host.
>>>>>>> >Valid Host statuses are "Non operational", "Maintenance" or
>>>>>>> >"Connecting".
>>>>>>> >
>>>>>>> >And I can't put it in maintenance, only option is "restart" or
>>>>>>> "stop".
>>>>>>> >
>>>>>>> >Regards
>>>>>>> >
>>>>>>> >Nar
>>>>>>> >
>>>>>>> >On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <hunter86...@yahoo.com
>>>>>>> >
>>>>>>> >wrote:
>>>>>>> >
>>>>>>> >> After rebooting the node, have you "marked" it that it was
>>>>>>> rebooted ?
>>>>>>> >>
>>>>>>> >> Best Regards,
>>>>>>> >> Strahil Nikolov
>>>>>>> >>
>>>>>>> >> На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys <
>>>>>>> >> nard...@gmail.com> написа:
>>>>>>> >> >Hi oVirt land
>>>>>>> >> >
>>>>>>> >> >Hope you are well. Got a bit of an issue, actually a big issue.
>>>>>>> We
>>>>>>> >had
>>>>>>> >> >some
>>>>>>> >> >sort of dip of some sort. All the VM's is still running, but
>>>>>>> some of
>>>>>>> >> >the
>>>>>>> >> >hosts is show "Unassigned" or "NonResponsive". So all the hosts
>>>>>>> was
>>>>>>> >> >showing
>>>>>>> >> >UP and was fine before our dip. So I did increase
>>>>>>> >vdsHeartbeatInSecond
>>>>>>> >> >to
>>>>>>> >> >240, no luck.
>>>>>>> >> >
>>>>>>> >> >I still get a timeout on the engine lock even thou I can connect
>>>>>>> to
>>>>>>> >> >that
>>>>>>> >> >host from the engine using nc to test to port 54321. I also did
>>>>>>> >restart
>>>>>>> >> >vdsmd and also rebooted the host with no luck.
>>>>>>> >> >
>>>>>>> >> > nc -v someserver 54321
>>>>>>> >> >Ncat: Version 7.50 ( https://nmap.org/ncat )
>>>>>>> >> >Ncat: Connected to 172.40.2.172:54321.
>>>>>>> >> >
>>>>>>> >> >2020-08-05 20:20:34,256+02 ERROR
>>>>>>> >>
>>>>>>>
>>>>>>> >>[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>>> >> >(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID:
>>>>>>> >> >VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get
>>>>>>> Host
>>>>>>> >> >Capabilities failed: Message timeout which can be caused by
>>>>>>> >> >communication
>>>>>>> >> >issues
>>>>>>> >> >
>>>>>>> >> >Any troubleshoot ideas will be gladly appreciated.
>>>>>>> >> >
>>>>>>> >> >Regards
>>>>>>> >> >
>>>>>>> >> >Nar
>>>>>>> >>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list -- users@ovirt.org
>>>>>> To unsubscribe send an email to users-le...@ovirt.org
>>>>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>>>>>> oVirt Code of Conduct:
>>>>>> https://www.ovirt.org/community/about/community-guidelines/
>>>>>> List Archives:
>>>>>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76FI2325Z4AV4VCCEKH4M3S/
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Artur Socha
>>>>> Senior Software Engineer, RHV
>>>>> Red Hat
>>>>>
>>>>
>>>
>>> --
>>> Artur Socha
>>> Senior Software Engineer, RHV
>>> Red Hat
>>>
>>
>
> --
> Artur Socha
> Senior Software Engineer, RHV
> Red Hat
>


-- 
Artur Socha
Senior Software Engineer, RHV
Red Hat
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/VDRNQQB27R2K5EUIWV7FP4W36B2P2YP5/

Reply via email to