[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

2021-07-25 Thread Yedidyah Bar David
On Fri, Jul 23, 2021 at 6:17 PM Christoph Timm  wrote:
>
>
>
> Am 21.07.21 um 12:33 schrieb Christoph Timm:
> >
> > Am 21.07.21 um 12:17 schrieb Yedidyah Bar David:
> >> On Mon, Jul 19, 2021 at 2:20 PM Yedidyah Bar David 
> >> wrote:
> >>> On Mon, Jul 19, 2021 at 1:54 PM Christoph Timm  wrote:
> 
> 
>  Am 19.07.21 um 10:52 schrieb Yedidyah Bar David:
> > On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm 
> > wrote:
> >> Am 19.07.21 um 10:25 schrieb Yedidyah Bar David:
> >>> On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm
> >>>  wrote:
>  Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:
> > On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm
> >  wrote:
> >> Hi Didi,
> >>
> >> thank you for the quick response.
> >>
> >>
> >> Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:
> >>> On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm
> >>>  wrote:
>  Hi List,
> 
>  I'm trying to understand why my hosted engine is moved from
>  one node to
>  another from time to time.
>  It is happening sometime multiple times a day. But there
>  are also days
>  without it.
> 
>  I can see the following in the
>  ovirt-hosted-engine-ha/agent.log:
>  ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> 
>  Penalizing score by 1600 due to network status
> 
>  After that the engine will be shutdown and started on
>  another host.
>  The oVirt Admin portal is showing the following around the
>  same time:
>  Invalid status on Data Center Default. Setting status to
>  Non Responsive.
> 
>  But the whole cluster is working normally during that time.
> 
>  I believe that I have somehow a network issue on my side
>  but I have no
>  clue what kind of check is causing the network status to
>  penalized.
> 
>  Does anyone have an idea how to investigate this further?
> >>> Please check also broker.log. Do you see 'dig' failures?
> >> Yes I found them as well.
> >>
> >> Thread-1::WARNING::2021-07-19
> >> 08:02:00,032::network::120::network.Network::(_dns) DNS query
> >> failed:
> >> ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
> >> ;; global options: +cmd
> >> ;; connection timed out; no servers could be reached
> >>
> >>> This happened several times already on our CI
> >>> infrastructure, but yours is
> >>> the first report from an actual real user. See also:
> >>>
> >>> https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/
> >>>
> >> So I understand that the following command is triggered to
> >> test the
> >> network: "dig +tries=1 +time=5"
> > Indeed.
> >
> >>> I didn't open a bug for this (yet?), also because I never
> >>> reproduced on my
> >>> own machines and am not sure about the exact failing flow.
> >>> If this is
> >>> reproducible
> >>> reliably for you, you might want to test the patch I pushed:
> >>>
> >>> https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596
> >> Now filed this bug and linked to it in the above patch. Thanks for
> >> your report!
> >>
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1984356
> > Perfect I added me cc as well.
> >
> > I have implemented the change on one of my nodes, restarted the
> > ovirt-ha-broker and moved the engine to that node.
> > Since than the issue did not occur. I guess I will leave it running
> > until end of the week and will move the engine back to a none changed
> > node to see that the issue is back again.
> So I had no issue with the changed host until now. So I moved the engine
> to different host in the morning and now I received the issue. So I will
> implement the fix on all my hosts now.
> So hope this fix will be permanently included in the next release.

Yes, the bug is targeted 4.4.8 and patch is merged.

Best regards,

> >>
> >> Best regards,
> >>
> >> I'm happy to give it a try.
> >> Please confirm that I need to replace this file (network.py)
> >> on all my
> >> nodes (CentOS 8.4 based) which can host my engine.
> > It definitely makes sense to do so, but in principle there is
> > no problem
> > with applying it only on some of them. That's especially
> > useful if you try
> > this first on a test env and try to enforce a reproduction
> > somehow (overload
> > the network, disconnect stuff, etc.).
>  OK

[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

2021-07-23 Thread Christoph Timm



Am 21.07.21 um 12:33 schrieb Christoph Timm:


Am 21.07.21 um 12:17 schrieb Yedidyah Bar David:
On Mon, Jul 19, 2021 at 2:20 PM Yedidyah Bar David  
wrote:

On Mon, Jul 19, 2021 at 1:54 PM Christoph Timm  wrote:



Am 19.07.21 um 10:52 schrieb Yedidyah Bar David:
On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm  
wrote:

Am 19.07.21 um 10:25 schrieb Yedidyah Bar David:
On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm 
 wrote:

Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:
On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm 
 wrote:

Hi Didi,

thank you for the quick response.


Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:
On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm 
 wrote:

Hi List,

I'm trying to understand why my hosted engine is moved from 
one node to

another from time to time.
It is happening sometime multiple times a day. But there 
are also days

without it.

I can see the following in the 
ovirt-hosted-engine-ha/agent.log:
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) 


Penalizing score by 1600 due to network status

After that the engine will be shutdown and started on 
another host.
The oVirt Admin portal is showing the following around the 
same time:
Invalid status on Data Center Default. Setting status to 
Non Responsive.


But the whole cluster is working normally during that time.

I believe that I have somehow a network issue on my side 
but I have no
clue what kind of check is causing the network status to 
penalized.


Does anyone have an idea how to investigate this further?

Please check also broker.log. Do you see 'dig' failures?

Yes I found them as well.

Thread-1::WARNING::2021-07-19
08:02:00,032::network::120::network.Network::(_dns) DNS query 
failed:

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached

This happened several times already on our CI 
infrastructure, but yours is

the first report from an actual real user. See also:

https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/ 

So I understand that the following command is triggered to 
test the

network: "dig +tries=1 +time=5"

Indeed.

I didn't open a bug for this (yet?), also because I never 
reproduced on my
own machines and am not sure about the exact failing flow. 
If this is

reproducible
reliably for you, you might want to test the patch I pushed:

https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596
Now filed this bug and linked to it in the above patch. Thanks for 
your report!


https://bugzilla.redhat.com/show_bug.cgi?id=1984356

Perfect I added me cc as well.

I have implemented the change on one of my nodes, restarted the 
ovirt-ha-broker and moved the engine to that node.
Since than the issue did not occur. I guess I will leave it running 
until end of the week and will move the engine back to a none changed 
node to see that the issue is back again.
So I had no issue with the changed host until now. So I moved the engine 
to different host in the morning and now I received the issue. So I will 
implement the fix on all my hosts now.

So hope this fix will be permanently included in the next release.


Best regards,


I'm happy to give it a try.
Please confirm that I need to replace this file (network.py) 
on all my

nodes (CentOS 8.4 based) which can host my engine.
It definitely makes sense to do so, but in principle there is 
no problem
with applying it only on some of them. That's especially 
useful if you try
this first on a test env and try to enforce a reproduction 
somehow (overload

the network, disconnect stuff, etc.).

OK will give it a try and report back.

Thanks and good luck.

Do I need to restart anything after that change?
Yes, the broker. This might restart some other services there, so 
best put the

host to maintenance during this.

Also please confirm that the comma after TCP is correct as there 
wasn't

one before after the timeout in row 110.

It is correct, but not mandatory. We (my team, at least) often add it
in such cases
to make a theoretical future patch that adds another parameter not
require adding
it again (thus making the patch smaller and hopefully cleaner).

Other ideas/opinions about how to enhance this part of the 
monitoring

are most welcome.

If this phenomenon is new for you, and you can reliably say 
it's not due to
a recent "natural" higher network load, I wonder if it's due 
to some weird

bug/change somewhere.

I'm quite sure that I see this since we moved to 4.4.(4).
Just for house keeping I'm running 4.4.7 now.
We use 'dig' as the network monitor since 4.3.5, around one 
year before 4.4

was released: https://bugzilla.redhat.com/1659052

Which version did you use before 4.4?
The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10 before 
migrating

to 4.4.4.
I now realize that in above-linked bug we only changed the 
default, for new
setups. So if you deployed He before 4.3.5, upgrade to later 4.3 

[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

2021-07-21 Thread Christoph Timm


Am 21.07.21 um 12:17 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 2:20 PM Yedidyah Bar David  wrote:

On Mon, Jul 19, 2021 at 1:54 PM Christoph Timm  wrote:



Am 19.07.21 um 10:52 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm  wrote:

Am 19.07.21 um 10:25 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm  wrote:

Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm  wrote:

Hi Didi,

thank you for the quick response.


Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm  wrote:

Hi List,

I'm trying to understand why my hosted engine is moved from one node to
another from time to time.
It is happening sometime multiple times a day. But there are also days
without it.

I can see the following in the ovirt-hosted-engine-ha/agent.log:
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Penalizing score by 1600 due to network status

After that the engine will be shutdown and started on another host.
The oVirt Admin portal is showing the following around the same time:
Invalid status on Data Center Default. Setting status to Non Responsive.

But the whole cluster is working normally during that time.

I believe that I have somehow a network issue on my side but I have no
clue what kind of check is causing the network status to penalized.

Does anyone have an idea how to investigate this further?

Please check also broker.log. Do you see 'dig' failures?

Yes I found them as well.

Thread-1::WARNING::2021-07-19
08:02:00,032::network::120::network.Network::(_dns) DNS query failed:
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached


This happened several times already on our CI infrastructure, but yours is
the first report from an actual real user. See also:

https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/

So I understand that the following command is triggered to test the
network: "dig +tries=1 +time=5"

Indeed.


I didn't open a bug for this (yet?), also because I never reproduced on my
own machines and am not sure about the exact failing flow. If this is
reproducible
reliably for you, you might want to test the patch I pushed:

https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596

Now filed this bug and linked to it in the above patch. Thanks for your report!

https://bugzilla.redhat.com/show_bug.cgi?id=1984356

Perfect I added me cc as well.

I have implemented the change on one of my nodes, restarted the 
ovirt-ha-broker and moved the engine to that node.
Since than the issue did not occur. I guess I will leave it running 
until end of the week and will move the engine back to a none changed 
node to see that the issue is back again.


Best regards,


I'm happy to give it a try.
Please confirm that I need to replace this file (network.py) on all my
nodes (CentOS 8.4 based) which can host my engine.

It definitely makes sense to do so, but in principle there is no problem
with applying it only on some of them. That's especially useful if you try
this first on a test env and try to enforce a reproduction somehow (overload
the network, disconnect stuff, etc.).

OK will give it a try and report back.

Thanks and good luck.

Do I need to restart anything after that change?

Yes, the broker. This might restart some other services there, so best put the
host to maintenance during this.


Also please confirm that the comma after TCP is correct as there wasn't
one before after the timeout in row 110.

It is correct, but not mandatory. We (my team, at least) often add it
in such cases
to make a theoretical future patch that adds another parameter not
require adding
it again (thus making the patch smaller and hopefully cleaner).


Other ideas/opinions about how to enhance this part of the monitoring
are most welcome.

If this phenomenon is new for you, and you can reliably say it's not due to
a recent "natural" higher network load, I wonder if it's due to some weird
bug/change somewhere.

I'm quite sure that I see this since we moved to 4.4.(4).
Just for house keeping I'm running 4.4.7 now.

We use 'dig' as the network monitor since 4.3.5, around one year before 4.4
was released: https://bugzilla.redhat.com/1659052

Which version did you use before 4.4?

The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10 before migrating
to 4.4.4.

I now realize that in above-linked bug we only changed the default, for new
setups. So if you deployed He before 4.3.5, upgrade to later 4.3 would not
change the default (as opposed to upgrade to 4.4, which was actually a
new deployment with engine backup/restore). Do you know which version
your cluster was originally deployed with?

Hm, I'm sorry but I don't recall this. I'm quite sure that we started

OK, thanks for trying.


with 4.0 something. But we moved to a HE setup ar

[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

2021-07-21 Thread Yedidyah Bar David
On Mon, Jul 19, 2021 at 2:20 PM Yedidyah Bar David  wrote:
>
> On Mon, Jul 19, 2021 at 1:54 PM Christoph Timm  wrote:
> >
> >
> >
> > Am 19.07.21 um 10:52 schrieb Yedidyah Bar David:
> > > On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm  wrote:
> > >>
> > >> Am 19.07.21 um 10:25 schrieb Yedidyah Bar David:
> > >>> On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm  wrote:
> >  Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:
> > > On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm  
> > > wrote:
> > >> Hi Didi,
> > >>
> > >> thank you for the quick response.
> > >>
> > >>
> > >> Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:
> > >>> On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm  
> > >>> wrote:
> >  Hi List,
> > 
> >  I'm trying to understand why my hosted engine is moved from one 
> >  node to
> >  another from time to time.
> >  It is happening sometime multiple times a day. But there are also 
> >  days
> >  without it.
> > 
> >  I can see the following in the ovirt-hosted-engine-ha/agent.log:
> >  ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> >  Penalizing score by 1600 due to network status
> > 
> >  After that the engine will be shutdown and started on another host.
> >  The oVirt Admin portal is showing the following around the same 
> >  time:
> >  Invalid status on Data Center Default. Setting status to Non 
> >  Responsive.
> > 
> >  But the whole cluster is working normally during that time.
> > 
> >  I believe that I have somehow a network issue on my side but I 
> >  have no
> >  clue what kind of check is causing the network status to penalized.
> > 
> >  Does anyone have an idea how to investigate this further?
> > >>> Please check also broker.log. Do you see 'dig' failures?
> > >> Yes I found them as well.
> > >>
> > >> Thread-1::WARNING::2021-07-19
> > >> 08:02:00,032::network::120::network.Network::(_dns) DNS query failed:
> > >> ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
> > >> ;; global options: +cmd
> > >> ;; connection timed out; no servers could be reached
> > >>
> > >>> This happened several times already on our CI infrastructure, but 
> > >>> yours is
> > >>> the first report from an actual real user. See also:
> > >>>
> > >>> https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/
> > >> So I understand that the following command is triggered to test the
> > >> network: "dig +tries=1 +time=5"
> > > Indeed.
> > >
> > >>> I didn't open a bug for this (yet?), also because I never 
> > >>> reproduced on my
> > >>> own machines and am not sure about the exact failing flow. If this 
> > >>> is
> > >>> reproducible
> > >>> reliably for you, you might want to test the patch I pushed:
> > >>>
> > >>> https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596

Now filed this bug and linked to it in the above patch. Thanks for your report!

https://bugzilla.redhat.com/show_bug.cgi?id=1984356

Best regards,

> > >> I'm happy to give it a try.
> > >> Please confirm that I need to replace this file (network.py) on all 
> > >> my
> > >> nodes (CentOS 8.4 based) which can host my engine.
> > > It definitely makes sense to do so, but in principle there is no 
> > > problem
> > > with applying it only on some of them. That's especially useful if 
> > > you try
> > > this first on a test env and try to enforce a reproduction somehow 
> > > (overload
> > > the network, disconnect stuff, etc.).
> >  OK will give it a try and report back.
> > >>> Thanks and good luck.
> > Do I need to restart anything after that change?
>
> Yes, the broker. This might restart some other services there, so best put the
> host to maintenance during this.
>
> > Also please confirm that the comma after TCP is correct as there wasn't
> > one before after the timeout in row 110.
>
> It is correct, but not mandatory. We (my team, at least) often add it
> in such cases
> to make a theoretical future patch that adds another parameter not
> require adding
> it again (thus making the patch smaller and hopefully cleaner).
>
> > >>>
> > >>> Other ideas/opinions about how to enhance this part of the 
> > >>> monitoring
> > >>> are most welcome.
> > >>>
> > >>> If this phenomenon is new for you, and you can reliably say it's 
> > >>> not due to
> > >>> a recent "natural" higher network load, I wonder if it's due to 
> > >>> some weird
> > >>> bug/change somewhere.
> > >> I'm quite sure that I see this since we moved to 4.4.(4).
> > >> Just for house keeping I'm running 4.4.7 now.
> > > We use 'dig' as the network 

[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

2021-07-19 Thread Yedidyah Bar David
On Mon, Jul 19, 2021 at 1:54 PM Christoph Timm  wrote:
>
>
>
> Am 19.07.21 um 10:52 schrieb Yedidyah Bar David:
> > On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm  wrote:
> >>
> >> Am 19.07.21 um 10:25 schrieb Yedidyah Bar David:
> >>> On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm  wrote:
>  Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:
> > On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm  wrote:
> >> Hi Didi,
> >>
> >> thank you for the quick response.
> >>
> >>
> >> Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:
> >>> On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm  
> >>> wrote:
>  Hi List,
> 
>  I'm trying to understand why my hosted engine is moved from one node 
>  to
>  another from time to time.
>  It is happening sometime multiple times a day. But there are also 
>  days
>  without it.
> 
>  I can see the following in the ovirt-hosted-engine-ha/agent.log:
>  ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>  Penalizing score by 1600 due to network status
> 
>  After that the engine will be shutdown and started on another host.
>  The oVirt Admin portal is showing the following around the same time:
>  Invalid status on Data Center Default. Setting status to Non 
>  Responsive.
> 
>  But the whole cluster is working normally during that time.
> 
>  I believe that I have somehow a network issue on my side but I have 
>  no
>  clue what kind of check is causing the network status to penalized.
> 
>  Does anyone have an idea how to investigate this further?
> >>> Please check also broker.log. Do you see 'dig' failures?
> >> Yes I found them as well.
> >>
> >> Thread-1::WARNING::2021-07-19
> >> 08:02:00,032::network::120::network.Network::(_dns) DNS query failed:
> >> ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
> >> ;; global options: +cmd
> >> ;; connection timed out; no servers could be reached
> >>
> >>> This happened several times already on our CI infrastructure, but 
> >>> yours is
> >>> the first report from an actual real user. See also:
> >>>
> >>> https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/
> >> So I understand that the following command is triggered to test the
> >> network: "dig +tries=1 +time=5"
> > Indeed.
> >
> >>> I didn't open a bug for this (yet?), also because I never reproduced 
> >>> on my
> >>> own machines and am not sure about the exact failing flow. If this is
> >>> reproducible
> >>> reliably for you, you might want to test the patch I pushed:
> >>>
> >>> https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596
> >> I'm happy to give it a try.
> >> Please confirm that I need to replace this file (network.py) on all my
> >> nodes (CentOS 8.4 based) which can host my engine.
> > It definitely makes sense to do so, but in principle there is no problem
> > with applying it only on some of them. That's especially useful if you 
> > try
> > this first on a test env and try to enforce a reproduction somehow 
> > (overload
> > the network, disconnect stuff, etc.).
>  OK will give it a try and report back.
> >>> Thanks and good luck.
> Do I need to restart anything after that change?

Yes, the broker. This might restart some other services there, so best put the
host to maintenance during this.

> Also please confirm that the comma after TCP is correct as there wasn't
> one before after the timeout in row 110.

It is correct, but not mandatory. We (my team, at least) often add it
in such cases
to make a theoretical future patch that adds another parameter not
require adding
it again (thus making the patch smaller and hopefully cleaner).

> >>>
> >>> Other ideas/opinions about how to enhance this part of the monitoring
> >>> are most welcome.
> >>>
> >>> If this phenomenon is new for you, and you can reliably say it's not 
> >>> due to
> >>> a recent "natural" higher network load, I wonder if it's due to some 
> >>> weird
> >>> bug/change somewhere.
> >> I'm quite sure that I see this since we moved to 4.4.(4).
> >> Just for house keeping I'm running 4.4.7 now.
> > We use 'dig' as the network monitor since 4.3.5, around one year before 
> > 4.4
> > was released: https://bugzilla.redhat.com/1659052
> >
> > Which version did you use before 4.4?
>  The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10 before migrating
>  to 4.4.4.
> >>> I now realize that in above-linked bug we only changed the default, for 
> >>> new
> >>> setups. So if you deployed He before 4.3.5, upgrade to later 4.3 would not
> >>> change the default (as opposed to upgrade to 4.4, w

[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

2021-07-19 Thread Christoph Timm



Am 19.07.21 um 10:52 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm  wrote:


Am 19.07.21 um 10:25 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm  wrote:

Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm  wrote:

Hi Didi,

thank you for the quick response.


Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm  wrote:

Hi List,

I'm trying to understand why my hosted engine is moved from one node to
another from time to time.
It is happening sometime multiple times a day. But there are also days
without it.

I can see the following in the ovirt-hosted-engine-ha/agent.log:
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Penalizing score by 1600 due to network status

After that the engine will be shutdown and started on another host.
The oVirt Admin portal is showing the following around the same time:
Invalid status on Data Center Default. Setting status to Non Responsive.

But the whole cluster is working normally during that time.

I believe that I have somehow a network issue on my side but I have no
clue what kind of check is causing the network status to penalized.

Does anyone have an idea how to investigate this further?

Please check also broker.log. Do you see 'dig' failures?

Yes I found them as well.

Thread-1::WARNING::2021-07-19
08:02:00,032::network::120::network.Network::(_dns) DNS query failed:
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached


This happened several times already on our CI infrastructure, but yours is
the first report from an actual real user. See also:

https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/

So I understand that the following command is triggered to test the
network: "dig +tries=1 +time=5"

Indeed.


I didn't open a bug for this (yet?), also because I never reproduced on my
own machines and am not sure about the exact failing flow. If this is
reproducible
reliably for you, you might want to test the patch I pushed:

https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596

I'm happy to give it a try.
Please confirm that I need to replace this file (network.py) on all my
nodes (CentOS 8.4 based) which can host my engine.

It definitely makes sense to do so, but in principle there is no problem
with applying it only on some of them. That's especially useful if you try
this first on a test env and try to enforce a reproduction somehow (overload
the network, disconnect stuff, etc.).

OK will give it a try and report back.

Thanks and good luck.

Do I need to restart anything after that change?
Also please confirm that the comma after TCP is correct as there wasn't 
one before after the timeout in row 110.



Other ideas/opinions about how to enhance this part of the monitoring
are most welcome.

If this phenomenon is new for you, and you can reliably say it's not due to
a recent "natural" higher network load, I wonder if it's due to some weird
bug/change somewhere.

I'm quite sure that I see this since we moved to 4.4.(4).
Just for house keeping I'm running 4.4.7 now.

We use 'dig' as the network monitor since 4.3.5, around one year before 4.4
was released: https://bugzilla.redhat.com/1659052

Which version did you use before 4.4?

The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10 before migrating
to 4.4.4.

I now realize that in above-linked bug we only changed the default, for new
setups. So if you deployed He before 4.3.5, upgrade to later 4.3 would not
change the default (as opposed to upgrade to 4.4, which was actually a
new deployment with engine backup/restore). Do you know which version
your cluster was originally deployed with?

Hm, I'm sorry but I don't recall this. I'm quite sure that we started

OK, thanks for trying.


with 4.0 something. But we moved to a HE setup around September 2019.
But I don't recall the version. But we installed also the backup from
the old installation into the HE environment if I'm not wrong.

If indeed this change was the trigger for you, you can rather easily try to
change this to 'ping' and see if this helps - I think it's enough to change
'network_test' to 'ping' in /etc/ovirt-hosted-engine/hosted-engine.conf
and restart the broker - didn't try, though. But generally speaking, I do not
think we want to change the default back to 'ping', but rather make 'dns'
work better/well. We had valid reasons to move away from ping...

OK I will try this if the tcp change does not help me.


Best regards,

___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.or

[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

2021-07-19 Thread Yedidyah Bar David
On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm  wrote:
>
>
> Am 19.07.21 um 10:25 schrieb Yedidyah Bar David:
> > On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm  wrote:
> >>
> >> Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:
> >>> On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm  wrote:
>  Hi Didi,
> 
>  thank you for the quick response.
> 
> 
>  Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:
> > On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm  wrote:
> >> Hi List,
> >>
> >> I'm trying to understand why my hosted engine is moved from one node to
> >> another from time to time.
> >> It is happening sometime multiple times a day. But there are also days
> >> without it.
> >>
> >> I can see the following in the ovirt-hosted-engine-ha/agent.log:
> >> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> >> Penalizing score by 1600 due to network status
> >>
> >> After that the engine will be shutdown and started on another host.
> >> The oVirt Admin portal is showing the following around the same time:
> >> Invalid status on Data Center Default. Setting status to Non 
> >> Responsive.
> >>
> >> But the whole cluster is working normally during that time.
> >>
> >> I believe that I have somehow a network issue on my side but I have no
> >> clue what kind of check is causing the network status to penalized.
> >>
> >> Does anyone have an idea how to investigate this further?
> > Please check also broker.log. Do you see 'dig' failures?
>  Yes I found them as well.
> 
>  Thread-1::WARNING::2021-07-19
>  08:02:00,032::network::120::network.Network::(_dns) DNS query failed:
>  ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
>  ;; global options: +cmd
>  ;; connection timed out; no servers could be reached
> 
> > This happened several times already on our CI infrastructure, but yours 
> > is
> > the first report from an actual real user. See also:
> >
> > https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/
>  So I understand that the following command is triggered to test the
>  network: "dig +tries=1 +time=5"
> >>> Indeed.
> >>>
> > I didn't open a bug for this (yet?), also because I never reproduced on 
> > my
> > own machines and am not sure about the exact failing flow. If this is
> > reproducible
> > reliably for you, you might want to test the patch I pushed:
> >
> > https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596
>  I'm happy to give it a try.
>  Please confirm that I need to replace this file (network.py) on all my
>  nodes (CentOS 8.4 based) which can host my engine.
> >>> It definitely makes sense to do so, but in principle there is no problem
> >>> with applying it only on some of them. That's especially useful if you try
> >>> this first on a test env and try to enforce a reproduction somehow 
> >>> (overload
> >>> the network, disconnect stuff, etc.).
> >> OK will give it a try and report back.
> > Thanks and good luck.
> >
> > Other ideas/opinions about how to enhance this part of the monitoring
> > are most welcome.
> >
> > If this phenomenon is new for you, and you can reliably say it's not 
> > due to
> > a recent "natural" higher network load, I wonder if it's due to some 
> > weird
> > bug/change somewhere.
>  I'm quite sure that I see this since we moved to 4.4.(4).
>  Just for house keeping I'm running 4.4.7 now.
> >>> We use 'dig' as the network monitor since 4.3.5, around one year before 
> >>> 4.4
> >>> was released: https://bugzilla.redhat.com/1659052
> >>>
> >>> Which version did you use before 4.4?
> >> The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10 before migrating
> >> to 4.4.4.
> > I now realize that in above-linked bug we only changed the default, for new
> > setups. So if you deployed He before 4.3.5, upgrade to later 4.3 would not
> > change the default (as opposed to upgrade to 4.4, which was actually a
> > new deployment with engine backup/restore). Do you know which version
> > your cluster was originally deployed with?
> Hm, I'm sorry but I don't recall this. I'm quite sure that we started

OK, thanks for trying.

> with 4.0 something. But we moved to a HE setup around September 2019.
> But I don't recall the version. But we installed also the backup from
> the old installation into the HE environment if I'm not wrong.

If indeed this change was the trigger for you, you can rather easily try to
change this to 'ping' and see if this helps - I think it's enough to change
'network_test' to 'ping' in /etc/ovirt-hosted-engine/hosted-engine.conf
and restart the broker - didn't try, though. But generally speaking, I do not
think we want to change the default back to 'ping', but rather make 'dns'
work better/well. We had valid reasons to mo

[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

2021-07-19 Thread Christoph Timm


Am 19.07.21 um 10:25 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm  wrote:


Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm  wrote:

Hi Didi,

thank you for the quick response.


Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm  wrote:

Hi List,

I'm trying to understand why my hosted engine is moved from one node to
another from time to time.
It is happening sometime multiple times a day. But there are also days
without it.

I can see the following in the ovirt-hosted-engine-ha/agent.log:
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Penalizing score by 1600 due to network status

After that the engine will be shutdown and started on another host.
The oVirt Admin portal is showing the following around the same time:
Invalid status on Data Center Default. Setting status to Non Responsive.

But the whole cluster is working normally during that time.

I believe that I have somehow a network issue on my side but I have no
clue what kind of check is causing the network status to penalized.

Does anyone have an idea how to investigate this further?

Please check also broker.log. Do you see 'dig' failures?

Yes I found them as well.

Thread-1::WARNING::2021-07-19
08:02:00,032::network::120::network.Network::(_dns) DNS query failed:
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached


This happened several times already on our CI infrastructure, but yours is
the first report from an actual real user. See also:

https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/

So I understand that the following command is triggered to test the
network: "dig +tries=1 +time=5"

Indeed.


I didn't open a bug for this (yet?), also because I never reproduced on my
own machines and am not sure about the exact failing flow. If this is
reproducible
reliably for you, you might want to test the patch I pushed:

https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596

I'm happy to give it a try.
Please confirm that I need to replace this file (network.py) on all my
nodes (CentOS 8.4 based) which can host my engine.

It definitely makes sense to do so, but in principle there is no problem
with applying it only on some of them. That's especially useful if you try
this first on a test env and try to enforce a reproduction somehow (overload
the network, disconnect stuff, etc.).

OK will give it a try and report back.

Thanks and good luck.


Other ideas/opinions about how to enhance this part of the monitoring
are most welcome.

If this phenomenon is new for you, and you can reliably say it's not due to
a recent "natural" higher network load, I wonder if it's due to some weird
bug/change somewhere.

I'm quite sure that I see this since we moved to 4.4.(4).
Just for house keeping I'm running 4.4.7 now.

We use 'dig' as the network monitor since 4.3.5, around one year before 4.4
was released: https://bugzilla.redhat.com/1659052

Which version did you use before 4.4?

The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10 before migrating
to 4.4.4.

I now realize that in above-linked bug we only changed the default, for new
setups. So if you deployed He before 4.3.5, upgrade to later 4.3 would not
change the default (as opposed to upgrade to 4.4, which was actually a
new deployment with engine backup/restore). Do you know which version
your cluster was originally deployed with?
Hm, I'm sorry but I don't recall this. I'm quite sure that we started 
with 4.0 something. But we moved to a HE setup around September 2019. 
But I don't recall the version. But we installed also the backup from 
the old installation into the HE environment if I'm not wrong.


Best regards,

___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/RWZ76D2OZ4ZXEMEOWZVQ75IZHMJP2V6D/


[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

2021-07-19 Thread Yedidyah Bar David
On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm  wrote:
>
>
> Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:
> > On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm  wrote:
> >> Hi Didi,
> >>
> >> thank you for the quick response.
> >>
> >>
> >> Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:
> >>> On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm  wrote:
>  Hi List,
> 
>  I'm trying to understand why my hosted engine is moved from one node to
>  another from time to time.
>  It is happening sometime multiple times a day. But there are also days
>  without it.
> 
>  I can see the following in the ovirt-hosted-engine-ha/agent.log:
>  ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>  Penalizing score by 1600 due to network status
> 
>  After that the engine will be shutdown and started on another host.
>  The oVirt Admin portal is showing the following around the same time:
>  Invalid status on Data Center Default. Setting status to Non Responsive.
> 
>  But the whole cluster is working normally during that time.
> 
>  I believe that I have somehow a network issue on my side but I have no
>  clue what kind of check is causing the network status to penalized.
> 
>  Does anyone have an idea how to investigate this further?
> >>> Please check also broker.log. Do you see 'dig' failures?
> >> Yes I found them as well.
> >>
> >> Thread-1::WARNING::2021-07-19
> >> 08:02:00,032::network::120::network.Network::(_dns) DNS query failed:
> >> ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
> >> ;; global options: +cmd
> >> ;; connection timed out; no servers could be reached
> >>
> >>> This happened several times already on our CI infrastructure, but yours is
> >>> the first report from an actual real user. See also:
> >>>
> >>> https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/
> >> So I understand that the following command is triggered to test the
> >> network: "dig +tries=1 +time=5"
> > Indeed.
> >
> >>> I didn't open a bug for this (yet?), also because I never reproduced on my
> >>> own machines and am not sure about the exact failing flow. If this is
> >>> reproducible
> >>> reliably for you, you might want to test the patch I pushed:
> >>>
> >>> https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596
> >> I'm happy to give it a try.
> >> Please confirm that I need to replace this file (network.py) on all my
> >> nodes (CentOS 8.4 based) which can host my engine.
> > It definitely makes sense to do so, but in principle there is no problem
> > with applying it only on some of them. That's especially useful if you try
> > this first on a test env and try to enforce a reproduction somehow (overload
> > the network, disconnect stuff, etc.).
> OK will give it a try and report back.

Thanks and good luck.

> >
> >>> Other ideas/opinions about how to enhance this part of the monitoring
> >>> are most welcome.
> >>>
> >>> If this phenomenon is new for you, and you can reliably say it's not due 
> >>> to
> >>> a recent "natural" higher network load, I wonder if it's due to some weird
> >>> bug/change somewhere.
> >> I'm quite sure that I see this since we moved to 4.4.(4).
> >> Just for house keeping I'm running 4.4.7 now.
> > We use 'dig' as the network monitor since 4.3.5, around one year before 4.4
> > was released: https://bugzilla.redhat.com/1659052
> >
> > Which version did you use before 4.4?
> The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10 before migrating
> to 4.4.4.

I now realize that in above-linked bug we only changed the default, for new
setups. So if you deployed He before 4.3.5, upgrade to later 4.3 would not
change the default (as opposed to upgrade to 4.4, which was actually a
new deployment with engine backup/restore). Do you know which version
your cluster was originally deployed with?

Best regards,
-- 
Didi
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/6X2DMNPAXCD34624CMBEZTZO4KU64KCG/


[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

2021-07-19 Thread Christoph Timm


Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm  wrote:

Hi Didi,

thank you for the quick response.


Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm  wrote:

Hi List,

I'm trying to understand why my hosted engine is moved from one node to
another from time to time.
It is happening sometime multiple times a day. But there are also days
without it.

I can see the following in the ovirt-hosted-engine-ha/agent.log:
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Penalizing score by 1600 due to network status

After that the engine will be shutdown and started on another host.
The oVirt Admin portal is showing the following around the same time:
Invalid status on Data Center Default. Setting status to Non Responsive.

But the whole cluster is working normally during that time.

I believe that I have somehow a network issue on my side but I have no
clue what kind of check is causing the network status to penalized.

Does anyone have an idea how to investigate this further?

Please check also broker.log. Do you see 'dig' failures?

Yes I found them as well.

Thread-1::WARNING::2021-07-19
08:02:00,032::network::120::network.Network::(_dns) DNS query failed:
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached


This happened several times already on our CI infrastructure, but yours is
the first report from an actual real user. See also:

https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/

So I understand that the following command is triggered to test the
network: "dig +tries=1 +time=5"

Indeed.


I didn't open a bug for this (yet?), also because I never reproduced on my
own machines and am not sure about the exact failing flow. If this is
reproducible
reliably for you, you might want to test the patch I pushed:

https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596

I'm happy to give it a try.
Please confirm that I need to replace this file (network.py) on all my
nodes (CentOS 8.4 based) which can host my engine.

It definitely makes sense to do so, but in principle there is no problem
with applying it only on some of them. That's especially useful if you try
this first on a test env and try to enforce a reproduction somehow (overload
the network, disconnect stuff, etc.).

OK will give it a try and report back.



Other ideas/opinions about how to enhance this part of the monitoring
are most welcome.

If this phenomenon is new for you, and you can reliably say it's not due to
a recent "natural" higher network load, I wonder if it's due to some weird
bug/change somewhere.

I'm quite sure that I see this since we moved to 4.4.(4).
Just for house keeping I'm running 4.4.7 now.

We use 'dig' as the network monitor since 4.3.5, around one year before 4.4
was released: https://bugzilla.redhat.com/1659052

Which version did you use before 4.4?
The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10 before migrating 
to 4.4.4.
  


Best regards,

___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/FLU4ULXUXBUFCQV237LLX3OBGYBTEW6Q/


[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

2021-07-19 Thread Yedidyah Bar David
On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm  wrote:
>
> Hi Didi,
>
> thank you for the quick response.
>
>
> Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:
> > On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm  wrote:
> >> Hi List,
> >>
> >> I'm trying to understand why my hosted engine is moved from one node to
> >> another from time to time.
> >> It is happening sometime multiple times a day. But there are also days
> >> without it.
> >>
> >> I can see the following in the ovirt-hosted-engine-ha/agent.log:
> >> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> >> Penalizing score by 1600 due to network status
> >>
> >> After that the engine will be shutdown and started on another host.
> >> The oVirt Admin portal is showing the following around the same time:
> >> Invalid status on Data Center Default. Setting status to Non Responsive.
> >>
> >> But the whole cluster is working normally during that time.
> >>
> >> I believe that I have somehow a network issue on my side but I have no
> >> clue what kind of check is causing the network status to penalized.
> >>
> >> Does anyone have an idea how to investigate this further?
> > Please check also broker.log. Do you see 'dig' failures?
> Yes I found them as well.
>
> Thread-1::WARNING::2021-07-19
> 08:02:00,032::network::120::network.Network::(_dns) DNS query failed:
> ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
> ;; global options: +cmd
> ;; connection timed out; no servers could be reached
>
> >
> > This happened several times already on our CI infrastructure, but yours is
> > the first report from an actual real user. See also:
> >
> > https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/
> So I understand that the following command is triggered to test the
> network: "dig +tries=1 +time=5"

Indeed.

> >
> > I didn't open a bug for this (yet?), also because I never reproduced on my
> > own machines and am not sure about the exact failing flow. If this is
> > reproducible
> > reliably for you, you might want to test the patch I pushed:
> >
> > https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596
> I'm happy to give it a try.
> Please confirm that I need to replace this file (network.py) on all my
> nodes (CentOS 8.4 based) which can host my engine.

It definitely makes sense to do so, but in principle there is no problem
with applying it only on some of them. That's especially useful if you try
this first on a test env and try to enforce a reproduction somehow (overload
the network, disconnect stuff, etc.).

> >
> > Other ideas/opinions about how to enhance this part of the monitoring
> > are most welcome.
> >
> > If this phenomenon is new for you, and you can reliably say it's not due to
> > a recent "natural" higher network load, I wonder if it's due to some weird
> > bug/change somewhere.
> I'm quite sure that I see this since we moved to 4.4.(4).
> Just for house keeping I'm running 4.4.7 now.

We use 'dig' as the network monitor since 4.3.5, around one year before 4.4
was released: https://bugzilla.redhat.com/1659052

Which version did you use before 4.4?

Best regards,
-- 
Didi
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/PI23BOXRQSK2HTJWIOT2RTFUJFK7LXFT/


[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

2021-07-19 Thread Christoph Timm

Hi Didi,

thank you for the quick response.


Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:

On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm  wrote:

Hi List,

I'm trying to understand why my hosted engine is moved from one node to
another from time to time.
It is happening sometime multiple times a day. But there are also days
without it.

I can see the following in the ovirt-hosted-engine-ha/agent.log:
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Penalizing score by 1600 due to network status

After that the engine will be shutdown and started on another host.
The oVirt Admin portal is showing the following around the same time:
Invalid status on Data Center Default. Setting status to Non Responsive.

But the whole cluster is working normally during that time.

I believe that I have somehow a network issue on my side but I have no
clue what kind of check is causing the network status to penalized.

Does anyone have an idea how to investigate this further?

Please check also broker.log. Do you see 'dig' failures?

Yes I found them as well.

Thread-1::WARNING::2021-07-19 
08:02:00,032::network::120::network.Network::(_dns) DNS query failed:

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached



This happened several times already on our CI infrastructure, but yours is
the first report from an actual real user. See also:

https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/
So I understand that the following command is triggered to test the 
network: "dig +tries=1 +time=5"


I didn't open a bug for this (yet?), also because I never reproduced on my
own machines and am not sure about the exact failing flow. If this is
reproducible
reliably for you, you might want to test the patch I pushed:

https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596

I'm happy to give it a try.
Please confirm that I need to replace this file (network.py) on all my 
nodes (CentOS 8.4 based) which can host my engine.


Other ideas/opinions about how to enhance this part of the monitoring
are most welcome.

If this phenomenon is new for you, and you can reliably say it's not due to
a recent "natural" higher network load, I wonder if it's due to some weird
bug/change somewhere.

I'm quite sure that I see this since we moved to 4.4.(4).
Just for house keeping I'm running 4.4.7 now.


Thanks and best regards,

___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/RBBILRNRT57YNREOKAYWWZFCJE5ACZRY/


[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

2021-07-18 Thread Yedidyah Bar David
On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm  wrote:
>
> Hi List,
>
> I'm trying to understand why my hosted engine is moved from one node to
> another from time to time.
> It is happening sometime multiple times a day. But there are also days
> without it.
>
> I can see the following in the ovirt-hosted-engine-ha/agent.log:
> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> Penalizing score by 1600 due to network status
>
> After that the engine will be shutdown and started on another host.
> The oVirt Admin portal is showing the following around the same time:
> Invalid status on Data Center Default. Setting status to Non Responsive.
>
> But the whole cluster is working normally during that time.
>
> I believe that I have somehow a network issue on my side but I have no
> clue what kind of check is causing the network status to penalized.
>
> Does anyone have an idea how to investigate this further?

Please check also broker.log. Do you see 'dig' failures?

This happened several times already on our CI infrastructure, but yours is
the first report from an actual real user. See also:

https://lists.ovirt.org/archives/list/in...@ovirt.org/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/

I didn't open a bug for this (yet?), also because I never reproduced on my
own machines and am not sure about the exact failing flow. If this is
reproducible
reliably for you, you might want to test the patch I pushed:

https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596

Other ideas/opinions about how to enhance this part of the monitoring
are most welcome.

If this phenomenon is new for you, and you can reliably say it's not due to
a recent "natural" higher network load, I wonder if it's due to some weird
bug/change somewhere.

Thanks and best regards,
-- 
Didi
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/5F3I646BN3SFT6QJNGYFXMO27ZPRMJZI/