[ovirt-users] Re: Random hosts disconnects

2020-09-27 Thread Anton Louw via Users
Hi Artur,

Apologies for the late response. Thanks for the below 

Yes, the 2 DC’s are located in two different physical locations. I will ask our 
data team if there is anything that they can see. This is a very strange issue. 
I see there was a similar issue that someone else also logged, but I see no 
solution has been posted. The subject of that email is: “VDSM HOST ISSUE - 
Message timeout which can be caused by communication issues”


Anton Louw
Cloud Engineer: Storage and Virtualization
__
D: 087 805 1572 | M: N/A
A: Rutherford Estate, 1 Scott Street, Waverley, Johannesburg
anton.l...@voxtelecom.co.za

www.vox.co.za



From: Artur Socha 
Sent: 18 September 2020 14:45
To: Anton Louw 
Cc: users@ovirt.org
Subject: Re: [ovirt-users] Re: Random hosts disconnects



On Fri, Sep 18, 2020 at 1:54 PM Anton Louw 
mailto:anton.l...@voxtelecom.co.za>> wrote:

Hi Artur,

Thanks for the reply. I have attached the system logs. There was a disconnect 
at 10:54, but no error that is different to the rest. I do see a whole lot of 
QEMU Guest Agent and block_io errors in the system logs. Not entirely sure what 
this means.

After a very quick search on the internet the first one does not seem to be 
severe at all - this guest agent provides only some information to VMs about 
the host.
Sep 18 10:50:41 node05.kvm.voxvm.co.za<http://node05.kvm.voxvm.co.za> 
libvirtd[23603]: 2020-09-18 08:50:41.493+: 23729: error : 
qemuDomainAgentAvailable:9133 : Guest agent is not responding: QEMU guest agent 
is not connected

The second one is unknown to me at all:
ISep 18 10:50:52 node05.kvm.voxvm.co.za<http://node05.kvm.voxvm.co.za> 
libvirtd[23603]: 2020-09-18 08:50:52.802+: 23729: error : 
qemuMonitorJSONBlockIoThrottleInfo:5005 : internal error: block_io_throttle 
inserted entry was not in expected format Sep 18
Perhaps someone with more libvirt/qemu background will comment on that.


Checking the vdsm logs at the time or the error, the only entry is the below:

“2020-09-18 10:55:57,081+ WARN  (qgapoller/2) [virt.periodic.VmDispatcher] 
could not run  at 0x7f2170395578> on 
['d3838612-70bb-4731-a0d4-8f65d31b40a6', 
'59a2f394-48fe-4bd9-91d6-08115f2eec0a', 'f81e3ab8-c1a9-4674-b238-7e229fd43e7c', 
'42189fa1-4381-02c7-d830-20eac408da2c', '423f1c57-f98e-707f-c0f9-d4958d3f0fec', 
'64d1eabc-20ff-4288-98ff-dcfd120fe7d2', '4218baf0-e2a1-42c7-2efd-077407f47b4d', 
'42184650-5a60-5403-d758-840bdbf92dd8', '492ea3fe-0a27-4dde-abf9-7d146ee1b988', 
'4218df00-15cd-bdf9-efd9-c5ead49fd89c', '9c373379-718b-4906-abc1-960fb1820c2d', 
'b9441c7a-0bfd-4d41-a8de-ee24e4259b36', 'd810325a-1a45-4054-a870-c8c052a22354', 
'42189d3f-4570-45ea-6e5a-94c85a5885a1'] (periodic:289)”

This WARN does not seem to be the cause ... it may be be the result because VM 
failed to be dispatched (perhaps due to lack of suitable hosts that got 
disconnected at a moment)


I am stumped. Do you think it is worth a shot increasing the 
vdsConnectionTimeout and vdsHeartbeatInSeconds to 40 for testing purposes?

I still don't think it will change anything unless your network between those 2 
DC is 'tcp over pigeons' kind of setup :)
Now, more seriously. Even if increasing timeouts would fix the connectivity I 
suspect the core issue would still remain ... in the best case scenario it 
could be postponed a bit.

Am I correct assuming those 2 DC are  located in 2 different physical 
locations?   If so then I would closely check the network itself first 
(including hardware like routers/switches).

Artur

Thanks


Anton Louw
Cloud Engineer: Storage and Virtualization at Vox

T:  087 805  | D: 087 805 1572
M: N/A
E: anton.l...@voxtelecom.co.za<mailto:anton.l...@voxtelecom.co.za>
A: Rutherford Estate, 1 Scott Street, Waverley, Johannesburg
www.vox.co.za<http://www.vox.co.za>

[F]<https://www.facebook.com/voxtelecomZA>

[T]<https://www.twitter.com/voxtelecom>

[I]<https://www.instagram.com/voxtelecomza/>

[L]<https://www.linkedin.com/company/voxtelecom>

[Y]<https://www.youtube.com/user/VoxTelecom>


From: Artur Socha mailto:aso...@redhat.com>>
Sent: 18 September 2020 13:27
To: Anton Louw mailto:anton.l...@voxtelecom.co.za>>
Cc: users@ovirt.org<mailto:users@ovirt.org>
Subject: Re: [ovirt-users] Re: Random hosts disconnects

Hi Anton,
I am not sure if changing this value would fix the issue. Defaults are pretty 
high. For example vdsHeartbeatInSeconds=30seconds, vdsTimeout=180seconds, 
vdsConnectionTimeout=20seconds.

Do you still have relevant logs from the affected hosts:
 /var/logs/vdsm/vdsm.log
 /var/logs/vdsm/supervdsm.log
Please look for any jsonrpc errors ie. write/read errors or (connection) 
timeouts.  Storage related warnings/errors might also be relevant.

Plus system logs if possible:
journalctl -f /usr/share/vdsm/vdsmd
journalctl  -f /usr/sbin/libvirtd

In order to get system logs from particular time period 

[ovirt-users] Re: Random hosts disconnects

2020-09-18 Thread Artur Socha
On Fri, Sep 18, 2020 at 1:54 PM Anton Louw 
wrote:

>
>
> Hi Artur,
>
>
>
> Thanks for the reply. I have attached the system logs. There was a
> disconnect at 10:54, but no error that is different to the rest. I do see a
> whole lot of QEMU Guest Agent and block_io errors in the system logs. Not
> entirely sure what this means.
>

After a very quick search on the internet the first one does not seem to be
severe at all - this guest agent provides only some information to VMs
about the host.
*Sep 18 10:50:41 node05.kvm.voxvm.co.za <http://node05.kvm.voxvm.co.za>
libvirtd[23603]: 2020-09-18 08:50:41.493+: 23729: error :
qemuDomainAgentAvailable:9133 : Guest agent is not responding: QEMU guest
agent is not connected*

The second one is unknown to me at all:
ISep 18 10:50:52 node05.kvm.voxvm.co.za libvirtd[23603]: 2020-09-18
08:50:52.802+: 23729: error : qemuMonitorJSONBlockIoThrottleInfo:5005 :
internal error: block_io_throttle inserted entry was not in expected format
Sep 18
Perhaps someone with more libvirt/qemu background will comment on that.


>
> Checking the vdsm logs at the time or the error, the only entry is the
> below:
>
>
>
> “2020-09-18 10:55:57,081+ WARN  (qgapoller/2)
> [virt.periodic.VmDispatcher] could not run  at
> 0x7f2170395578> on ['d3838612-70bb-4731-a0d4-8f65d31b40a6',
> '59a2f394-48fe-4bd9-91d6-08115f2eec0a',
> 'f81e3ab8-c1a9-4674-b238-7e229fd43e7c',
> '42189fa1-4381-02c7-d830-20eac408da2c',
> '423f1c57-f98e-707f-c0f9-d4958d3f0fec',
> '64d1eabc-20ff-4288-98ff-dcfd120fe7d2',
> '4218baf0-e2a1-42c7-2efd-077407f47b4d',
> '42184650-5a60-5403-d758-840bdbf92dd8',
> '492ea3fe-0a27-4dde-abf9-7d146ee1b988',
> '4218df00-15cd-bdf9-efd9-c5ead49fd89c',
> '9c373379-718b-4906-abc1-960fb1820c2d',
> 'b9441c7a-0bfd-4d41-a8de-ee24e4259b36',
> 'd810325a-1a45-4054-a870-c8c052a22354',
> '42189d3f-4570-45ea-6e5a-94c85a5885a1'] (periodic:289)”
>
>
This WARN does not seem to be the cause ... it may be be the result because
VM failed to be dispatched (perhaps due to lack of suitable hosts that got
disconnected at a moment)


>
> I am stumped. Do you think it is worth a shot increasing the 
> vdsConnectionTimeout
> and vdsHeartbeatInSeconds to 40 for testing purposes?
>

I still don't think it will change anything unless your network between
those 2 DC is 'tcp over pigeons' kind of setup :)
Now, more seriously. Even if increasing timeouts would fix the connectivity
I suspect the core issue would still remain ... in the best case scenario
it could be postponed a bit.

Am I correct assuming those 2 DC are  located in 2 different physical
locations?   If so then I would closely check the network itself first
(including hardware like routers/switches).

Artur

>
>
> Thanks
>
>
>
> *Anton Louw*
> *Cloud Engineer: Storage and Virtualization* at *Vox*
> --
> *T:*  087 805  | *D:* 087 805 1572
> *M:* N/A
> *E:* anton.l...@voxtelecom.co.za
> *A:* Rutherford Estate, 1 Scott Street, Waverley, Johannesburg
> www.vox.co.za
>
> [image: F] <https://www.facebook.com/voxtelecomZA>
> [image: T] <https://www.twitter.com/voxtelecom>
> [image: I] <https://www.instagram.com/voxtelecomza/>
> [image: L] <https://www.linkedin.com/company/voxtelecom>
> [image: Y] <https://www.youtube.com/user/VoxTelecom>
>
> *From:* Artur Socha 
> *Sent:* 18 September 2020 13:27
> *To:* Anton Louw 
> *Cc:* users@ovirt.org
> *Subject:* Re: [ovirt-users] Re: Random hosts disconnects
>
>
>
> Hi Anton,
>
> I am not sure if changing this value would fix the issue. Defaults are
> pretty high. For example vdsHeartbeatInSeconds=30seconds,
> vdsTimeout=180seconds, vdsConnectionTimeout=20seconds.
>
>
>
> Do you still have relevant logs from the affected hosts:
>
> * /var/logs/vdsm/vdsm.log*
>
> * /var/logs/vdsm/supervdsm.log*
>
> Please look for any jsonrpc errors ie. write/read errors or (connection)
> timeouts.  Storage related warnings/errors might also be relevant.
>
>
>
> Plus system logs if possible:
>
> *journalctl -f /usr/share/vdsm/vdsmd*
>
> *journalctl  -f /usr/sbin/libvirtd*
>
>
>
> In order to get system logs from particular time period please combine it
> with the following example using -S  -U options:
>
> *journalctl -S "2020-01-12 07:00:00" -U "2020-01-12 07:15:00"  *
>
> I haven't a clue what to look there for besides any warnings/errors or
> anything else that seems  unusual.
>
>
>
> Artur
>
>
>
>
>
> On Thu, Sep 17, 2020 at 8:09 AM Anton Louw via Users 
> wrote:
>
>
>
> Hi Everybody,
>
>
>
> Did some digging around, and saw a few things regarding 
> “vdsHeartbeatInSeconds”
&

[ovirt-users] Re: Random hosts disconnects

2020-09-18 Thread Anton Louw via Users
Hi Artur,

Thanks for the reply. I have attached the system logs. There was a disconnect 
at 10:54, but no error that is different to the rest. I do see a whole lot of 
QEMU Guest Agent and block_io errors in the system logs. Not entirely sure what 
this means.

Checking the vdsm logs at the time or the error, the only entry is the below:

“2020-09-18 10:55:57,081+ WARN  (qgapoller/2) [virt.periodic.VmDispatcher] 
could not run  at 0x7f2170395578> on 
['d3838612-70bb-4731-a0d4-8f65d31b40a6', 
'59a2f394-48fe-4bd9-91d6-08115f2eec0a', 'f81e3ab8-c1a9-4674-b238-7e229fd43e7c', 
'42189fa1-4381-02c7-d830-20eac408da2c', '423f1c57-f98e-707f-c0f9-d4958d3f0fec', 
'64d1eabc-20ff-4288-98ff-dcfd120fe7d2', '4218baf0-e2a1-42c7-2efd-077407f47b4d', 
'42184650-5a60-5403-d758-840bdbf92dd8', '492ea3fe-0a27-4dde-abf9-7d146ee1b988', 
'4218df00-15cd-bdf9-efd9-c5ead49fd89c', '9c373379-718b-4906-abc1-960fb1820c2d', 
'b9441c7a-0bfd-4d41-a8de-ee24e4259b36', 'd810325a-1a45-4054-a870-c8c052a22354', 
'42189d3f-4570-45ea-6e5a-94c85a5885a1'] (periodic:289)”

I am stumped. Do you think it is worth a shot increasing the 
vdsConnectionTimeout and vdsHeartbeatInSeconds to 40 for testing purposes?

Thanks


Anton Louw
Cloud Engineer: Storage and Virtualization
__
D: 087 805 1572 | M: N/A
A: Rutherford Estate, 1 Scott Street, Waverley, Johannesburg
anton.l...@voxtelecom.co.za

www.vox.co.za



From: Artur Socha 
Sent: 18 September 2020 13:27
To: Anton Louw 
Cc: users@ovirt.org
Subject: Re: [ovirt-users] Re: Random hosts disconnects

Hi Anton,
I am not sure if changing this value would fix the issue. Defaults are pretty 
high. For example vdsHeartbeatInSeconds=30seconds, vdsTimeout=180seconds, 
vdsConnectionTimeout=20seconds.

Do you still have relevant logs from the affected hosts:
 /var/logs/vdsm/vdsm.log
 /var/logs/vdsm/supervdsm.log
Please look for any jsonrpc errors ie. write/read errors or (connection) 
timeouts.  Storage related warnings/errors might also be relevant.

Plus system logs if possible:
journalctl -f /usr/share/vdsm/vdsmd
journalctl  -f /usr/sbin/libvirtd

In order to get system logs from particular time period please combine it with 
the following example using -S  -U options:
journalctl -S "2020-01-12 07:00:00" -U "2020-01-12 07:15:00"
I haven't a clue what to look there for besides any warnings/errors or anything 
else that seems  unusual.

Artur


On Thu, Sep 17, 2020 at 8:09 AM Anton Louw via Users 
mailto:users@ovirt.org>> wrote:

Hi Everybody,


Did some digging around, and saw a few things regarding “vdsHeartbeatInSeconds”

I had a look at the properties file located at 
/etc/ovirt-engine/engine-config/engine-config.properties, and do not see an 
entry for “vdsHeartbeatInSeconds.type=Integer”.

Seeing as these data centers are geographically split, could the 
“vdsHeartbeatInSeconds” potentially be the issue? Is it safe to increase this 
value after I add “vdsHeartbeatInSeconds.type=Integer” into my 
engine-config.properties<http://engine-config.properties> file?



Thanks


Anton Louw
Cloud Engineer: Storage and Virtualization at Vox

T:  087 805  | D: 087 805 1572
M: N/A
E: anton.l...@voxtelecom.co.za<mailto:anton.l...@voxtelecom.co.za>
A: Rutherford Estate, 1 Scott Street, Waverley, Johannesburg
www.vox.co.za<http://www.vox.co.za>

[F]<https://www.facebook.com/voxtelecomZA>

[T]<https://www.twitter.com/voxtelecom>

[I]<https://www.instagram.com/voxtelecomza/>

[L]<https://www.linkedin.com/company/voxtelecom>

[Y]<https://www.youtube.com/user/VoxTelecom>


From: Anton Louw via Users mailto:users@ovirt.org>>
Sent: 16 September 2020 09:01
To: users@ovirt.org<mailto:users@ovirt.org>
Subject: [ovirt-users] Random hosts disconnects


Hi All,

I have a strange issue in my oVirt environment. I currently have a standalone 
manager which is running in VMware. In my oVirt environment, I have two Data 
Centers. The manager is currently sitting on the same subnet as DC1. Randomly, 
hosts in DC2 will say “Not Responding” and then 2 seconds later, the hosts will 
activate again.

The strange thing is, when the manager was sitting on the same subnet as DC2, 
hosts in DC1 will randomly say “Not Responding”

I have tried going through the logs, but I cannot see anything out of the 
ordinary regarding why the hosts would drop connection. I have attached the 
engine.log for anybody that would like to do a spot check.

Thanks

Anton Louw
Cloud Engineer: Storage and Virtualization at Vox

T:  087 805  | D: 087 805 1572
M: N/A
E: anton.l...@voxtelecom.co.za<mailto:anton.l...@voxtelecom.co.za>
A: Rutherford Estate, 1 Scott Street, Waverley, Johannesburg
www.vox.co.za<http://www.vox.co.za>

[F]<https://www.facebook.com/voxtelecomZA>

[T]<https://www.twitter.com/voxtelecom>

[I]<https://www.instagram.com/voxtele

[ovirt-users] Re: Random hosts disconnects

2020-09-18 Thread Artur Socha
Hi Anton,
I am not sure if changing this value would fix the issue. Defaults are
pretty high. For example vdsHeartbeatInSeconds=30seconds,
vdsTimeout=180seconds, vdsConnectionTimeout=20seconds.

Do you still have relevant logs from the affected hosts:
* /var/logs/vdsm/vdsm.log*
* /var/logs/vdsm/supervdsm.log*
Please look for any jsonrpc errors ie. write/read errors or (connection)
timeouts.  Storage related warnings/errors might also be relevant.

Plus system logs if possible:
*journalctl -f /usr/share/vdsm/vdsmd*
*journalctl  -f /usr/sbin/libvirtd*

In order to get system logs from particular time period please combine it
with the following example using -S  -U options:

*journalctl -S "2020-01-12 07:00:00" -U "2020-01-12 07:15:00"  *
I haven't a clue what to look there for besides any warnings/errors or
anything else that seems  unusual.

Artur


On Thu, Sep 17, 2020 at 8:09 AM Anton Louw via Users 
wrote:

>
>
> Hi Everybody,
>
>
>
> Did some digging around, and saw a few things regarding 
> “vdsHeartbeatInSeconds”
>
> I had a look at the properties file located at 
> /etc/ovirt-engine/engine-config/engine-config.properties, and do not see an 
> entry for “vdsHeartbeatInSeconds.type=Integer”.
>
> Seeing as these data centers are geographically split, could the 
> “vdsHeartbeatInSeconds” potentially be the issue? Is it safe to increase this 
> value after I add “vdsHeartbeatInSeconds.type=Integer” into my 
> engine-config.properties file?
>
>
>
> Thanks
>
>
>
> *Anton Louw*
> *Cloud Engineer: Storage and Virtualization* at *Vox*
> --
> *T:*  087 805  | *D:* 087 805 1572
> *M:* N/A
> *E:* anton.l...@voxtelecom.co.za
> *A:* Rutherford Estate, 1 Scott Street, Waverley, Johannesburg
> www.vox.co.za
>
> [image: F] 
> [image: T] 
> [image: I] 
> [image: L] 
> [image: Y] 
>
> *From:* Anton Louw via Users 
> *Sent:* 16 September 2020 09:01
> *To:* users@ovirt.org
> *Subject:* [ovirt-users] Random hosts disconnects
>
>
>
>
>
> Hi All,
>
>
>
> I have a strange issue in my oVirt environment. I currently have a
> standalone manager which is running in VMware. In my oVirt environment, I
> have two Data Centers. The manager is currently sitting on the same subnet
> as DC1. Randomly, hosts in DC2 will say “Not Responding” and then 2 seconds
> later, the hosts will activate again.
>
>
>
> The strange thing is, when the manager was sitting on the same subnet as
> DC2, hosts in DC1 will randomly say “Not Responding”
>
>
>
> I have tried going through the logs, but I cannot see anything out of the
> ordinary regarding why the hosts would drop connection. I have attached the
> engine.log for anybody that would like to do a spot check.
>
>
>
> Thanks
>
>
>
> *Anton Louw*
>
> *Cloud Engineer: Storage and Virtualization* at *Vox*
> --
>
> *T:*  087 805  | *D:* 087 805 1572
> *M:* N/A
> *E:* anton.l...@voxtelecom.co.za
> *A:* Rutherford Estate, 1 Scott Street, Waverley, Johannesburg
> www.vox.co.za
>
>
>
> [image: F] 
>
>
>
> [image: T] 
>
>
>
> [image: I] 
>
>
>
> [image: L] 
>
>
>
> [image: Y] 
>
>
>
>
>
> [image: #VoxBrand]
> 
>
>
> *Disclaimer*
>
> The contents of this email are confidential to the sender and the intended
> recipient. Unless the contents are clearly and entirely of a personal
> nature, they are subject to copyright in favour of the holding company of
> the Vox group of companies. Any recipient who receives this email in error
> should immediately report the error to the sender and permanently delete
> this email from all storage devices.
>
> This email has been scanned for viruses and malware, and may have been
> automatically archived by *Mimecast Ltd*, an innovator in Software as a
> Service (SaaS) for business. Providing a *safer* and *more useful* place
> for your human generated data. Specializing in; Security, archiving and
> compliance. To find out more Click Here
> .
>
>
>
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/EJL246IPBGEHIQ5KUWG2APSTQWFE7VFK/
>


-- 
Artur Socha
Senior Software Engineer, RHV
Red Hat
___
Users mailing list -- 

[ovirt-users] Re: Random hosts disconnects

2020-09-17 Thread Anton Louw via Users
Hi Everybody,


Did some digging around, and saw a few things regarding “vdsHeartbeatInSeconds”

I had a look at the properties file located at 
/etc/ovirt-engine/engine-config/engine-config.properties, and do not see an 
entry for “vdsHeartbeatInSeconds.type=Integer”.

Seeing as these data centers are geographically split, could the 
“vdsHeartbeatInSeconds” potentially be the issue? Is it safe to increase this 
value after I add “vdsHeartbeatInSeconds.type=Integer” into my 
engine-config.properties file?



Thanks


Anton Louw
Cloud Engineer: Storage and Virtualization
__
D: 087 805 1572 | M: N/A
A: Rutherford Estate, 1 Scott Street, Waverley, Johannesburg
anton.l...@voxtelecom.co.za

www.vox.co.za



From: Anton Louw via Users 
Sent: 16 September 2020 09:01
To: users@ovirt.org
Subject: [ovirt-users] Random hosts disconnects


Hi All,

I have a strange issue in my oVirt environment. I currently have a standalone 
manager which is running in VMware. In my oVirt environment, I have two Data 
Centers. The manager is currently sitting on the same subnet as DC1. Randomly, 
hosts in DC2 will say “Not Responding” and then 2 seconds later, the hosts will 
activate again.

The strange thing is, when the manager was sitting on the same subnet as DC2, 
hosts in DC1 will randomly say “Not Responding”

I have tried going through the logs, but I cannot see anything out of the 
ordinary regarding why the hosts would drop connection. I have attached the 
engine.log for anybody that would like to do a spot check.

Thanks

Anton Louw
Cloud Engineer: Storage and Virtualization at Vox

T:  087 805  | D: 087 805 1572
M: N/A
E: anton.l...@voxtelecom.co.za
A: Rutherford Estate, 1 Scott Street, Waverley, Johannesburg
www.vox.co.za

[F]

[T]

[I]

[L]

[Y]



[#VoxBrand]

Disclaimer

The contents of this email are confidential to the sender and the intended 
recipient. Unless the contents are clearly and entirely of a personal nature, 
they are subject to copyright in favour of the holding company of the Vox group 
of companies. Any recipient who receives this email in error should immediately 
report the error to the sender and permanently delete this email from all 
storage devices.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more Click 
Here.



___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/EJL246IPBGEHIQ5KUWG2APSTQWFE7VFK/