[ovirt-users] Re: Non storage nodes erronously included in quota calculations for HCI?

2020-07-29 Thread thomas
Sorry Strahil, that I didn't get back to you on this...

I dare not repeat the exercise, because I have no idea if I'll get out of such 
a complete break-down again cleanly.

Since I don't have duplicate physical infrastructure to just test this 
behavior, I was going to use a big machine to run a test farm nested.

Spent about a week in trying to get nesting work, but it ultimately failed to 
run the overlay network of the hosted engine properly (separate post somewhere 
here).

And then I read somewhere in a response to a post way back here, that oVirt 
nested on oVirt isn't only "not supported" but known (although not documented 
or advertised) not to work at all.

So there went the chance to reproduce the issue...

What I find striking is that the 'original' oVirt or RHV from pre-gluster HCI 
days, seems to support the notion of shutting down compute nodes when there 
isn't enough workload to fill them in order to save energy. In a HCI 
environment that obviously doesn't play well with the gluster storage nodes, 
but pure compute nodes should still support cold standby.

Can't find any documentation on this, though.
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/JX3A7DJBQVM4UQ5RQ5QA6GFAWPVTO42V/


[ovirt-users] Re: Non storage nodes erronously included in quota calculations for HCI?

2020-05-21 Thread Strahil Nikolov via Users
On May 21, 2020 12:29:24 PM GMT+03:00, Strahil Nikolov via Users 
 wrote:
>On May 20, 2020 5:12:05 PM GMT+03:00, tho...@hoberg.net wrote:
>>OK ;-)
>>
>>3 node HCI 2+1 data/arbiter
>>added 3 compute-only nodes via host install without HE support which
>>add no storage to the gluster (install still adds them as peers).
>>
>>With 2 compute-only nodes inactive/down I updated the third compute
>>node (no contributing bricks) and saw all VMs pausing and glusterd on
>>the HCI nodes "lost quorum on brick engine/vmstore/data" when it
>>rebooted to activate the new kernel.
>>
>>Had to launch additional compute-only node to let glusterd on HCI
>nodes
>>recover quorum.
>>Seems glusterd computes quorum based on total peers (6) not on
>>redundancy (2+1).
>>
>>With the gluster volumes down, running VMs remain paused according th
>>virsh, HE and UI aren't there, hosted-engine --vm-status reports "not
>>retrieved from storage"
>>___
>>Users mailing list -- users@ovirt.org
>>To unsubscribe send an email to users-le...@ovirt.org
>>Privacy Statement: https://www.ovirt.org/privacy-policy.html
>>oVirt Code of Conduct:
>>https://www.ovirt.org/community/about/community-guidelines/
>>List Archives:
>>https://lists.ovirt.org/archives/list/users@ovirt.org/message/F6QOGNZVPMCRAW4KP3MSMHOXSSRA4IMY/
>
>Hi Thomas,
>
>Quite  strange.
>Get to  one of the gluster tsp nodes and provide  some data:
>
>gluster volume list
>gluster pool list
>for i in $(gluster  volume list); do gluster  volume status $i;echo;
>gluster volume status $i; echo;echo;echo; done
>
>Best Regards,
>Strahil Nikolov
>___
>Users mailing list -- users@ovirt.org
>To unsubscribe send an email to users-le...@ovirt.org
>Privacy Statement: https://www.ovirt.org/privacy-policy.html
>oVirt Code of Conduct:
>https://www.ovirt.org/community/about/community-guidelines/
>List Archives:
>https://lists.ovirt.org/archives/list/users@ovirt.org/message/DPXTHW6WMAYKWJHPA2VPUF3BIUC7OKTE/

Yeah...
The for loop should  use  'status' and 'info' , so it should be somwthing like:

gluster volume list
gluster pool list
for i in $(gluster  volume list); do gluster  volume status $i;echo; gluster 
volume info $i; echo;echo;echo; done

Best Regards,
Strahil Nikolov
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/AVROC45QWHFDVRX7YSHHTWL2TCBPTL7X/


[ovirt-users] Re: Non storage nodes erronously included in quota calculations for HCI?

2020-05-21 Thread Strahil Nikolov via Users
On May 20, 2020 5:12:05 PM GMT+03:00, tho...@hoberg.net wrote:
>OK ;-)
>
>3 node HCI 2+1 data/arbiter
>added 3 compute-only nodes via host install without HE support which
>add no storage to the gluster (install still adds them as peers).
>
>With 2 compute-only nodes inactive/down I updated the third compute
>node (no contributing bricks) and saw all VMs pausing and glusterd on
>the HCI nodes "lost quorum on brick engine/vmstore/data" when it
>rebooted to activate the new kernel.
>
>Had to launch additional compute-only node to let glusterd on HCI nodes
>recover quorum.
>Seems glusterd computes quorum based on total peers (6) not on
>redundancy (2+1).
>
>With the gluster volumes down, running VMs remain paused according th
>virsh, HE and UI aren't there, hosted-engine --vm-status reports "not
>retrieved from storage"
>___
>Users mailing list -- users@ovirt.org
>To unsubscribe send an email to users-le...@ovirt.org
>Privacy Statement: https://www.ovirt.org/privacy-policy.html
>oVirt Code of Conduct:
>https://www.ovirt.org/community/about/community-guidelines/
>List Archives:
>https://lists.ovirt.org/archives/list/users@ovirt.org/message/F6QOGNZVPMCRAW4KP3MSMHOXSSRA4IMY/

Hi Thomas,

Quite  strange.
Get to  one of the gluster tsp nodes and provide  some data:

gluster volume list
gluster pool list
for i in $(gluster  volume list); do gluster  volume status $i;echo; gluster 
volume status $i; echo;echo;echo; done

Best Regards,
Strahil Nikolov
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/DPXTHW6WMAYKWJHPA2VPUF3BIUC7OKTE/


[ovirt-users] Re: Non storage nodes erronously included in quota calculations for HCI?

2020-05-20 Thread thomas
OK ;-)

3 node HCI 2+1 data/arbiter
added 3 compute-only nodes via host install without HE support which add no 
storage to the gluster (install still adds them as peers).

With 2 compute-only nodes inactive/down I updated the third compute node (no 
contributing bricks) and saw all VMs pausing and glusterd on the HCI nodes 
"lost quorum on brick engine/vmstore/data" when it rebooted to activate the new 
kernel.

Had to launch additional compute-only node to let glusterd on HCI nodes recover 
quorum.
Seems glusterd computes quorum based on total peers (6) not on redundancy (2+1).

With the gluster volumes down, running VMs remain paused according th virsh, HE 
and UI aren't there, hosted-engine --vm-status reports "not retrieved from 
storage"
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/F6QOGNZVPMCRAW4KP3MSMHOXSSRA4IMY/


[ovirt-users] Re: Non storage nodes erronously included in quota calculations for HCI?

2020-05-19 Thread Strahil Nikolov via Users
On May 20, 2020 2:37:32 AM GMT+03:00, tho...@hoberg.net wrote:
>For my home-lab I operate a 3 node HCI cluster on 100% passive Atoms,
>mostly to run light infrastructure services such as LDAP and NextCloud.
>
>I then add workstations or even laptops as pure compute hosts to the
>cluster for bigger but temporary things, that might actually run a
>different OS most of the time or just be shut off. From oVirt's point
>of view, these are just first put into maintenance and then shut down
>until needed again. No fencing or power management, all manual.
>
>All nodes, even the HCI ones, run CentOS7 with more of a workstation
>configuration, so updates pile up pretty quickly.
>
>After I recently upgraded one of these extra compute nodes, I found my
>three node HCI cluster not just faltering, but indeed very hard to
>reactivate at all.
>
>The faltering is a distinct issue: I have the impression that reboots
>of oVirt nodes cause broadcast storms on my rather simplistic 10Gibt L2
>switch, which a normal CentOS instance (or any other OS)  doesn't, but
>that's for another post.
>
>No what struck me, was that the gluster daemons on the three HCI nodes
>kept complaining about a lack of quorum long after the network was all
>back to normal, even if all three of them were there, saw each other
>perfectly on "gluster show status all", ready and without any healing
>issues pending at all.
>Glusterd would complain on all three nodes that there was no quota for
>the bricks and stop them.
>
>That went away as soon as I started one additional compute node, a node
>that was a gluster peer (because an oVirt host added to a HCI cluster
>always gets put into the Gluster, even if it's not contributing
>storage) but had no bricks. Immediately the gluster daemon on the three
>nodes with contributing bricks would report back good quota and launch
>the volumes (and thus all the rest of oVirt), even if in terms of
>*storage bricks* nothing had changed.
>
>I am afraid that downing the extra compute-only oVirtNode will bring
>down the HCI: Clearly not the type of redundancy it's designed to
>deliver.
>
>Evidently such compute-only hosts (and gluster members) get included
>into some quorum deliberations even if they hold not a single brick,
>neither storage nor arbitration.
>
>To me that seems like a bug, if that is indeed what happens: There I
>need your advice and suggestions.
>
>AFAIK HCI is a late addition to oVirt/RHEV as storage and compute were
>orginally designed to be completely distinct. In fact there are still
>remnants of documentation which seem to prohibit using a node for both
>compute and storage... what HCI is all about.
>
>And I have seen compute nodes with "matching" storage (parts of a
>distinct HCI setup, that was taken down but still had all the storage
>and Gluster elements operable), being happliy absorbed into a HCI
>cluster with all Gluster storage appearing in the GUI etc., without any
>manual creation or inclusion of bricks: Fully automatic (and
>undocumented)!
>
>In that case it makes sense to widen the scope of quota calculations
>when additional nodes are hyperconverged elements with contributing
>bricks. It also seems the only way to turn a 3 node HCI into 6 or 9
>node one.
>
>But if you really just want to add compute nodes without bricks, those
>can't get "quota votes" without storage to play a role in the
>redundancy.
>
>I can easily imagine the missing "if then else" in the code here, but I
>was actually very surprised to see those failure and success messages
>coming from glusterd itself, which to my understanding is pretty
>unrelated to oVirt on top. Not from the management engine (wasn't
>running anyway), not from VDSM.
>
>Re-creating the scenario is very scary even if I have gone through this
>three times already, trying to just bring my HCI back up. And then
>there is so verbose logs all over the place that I'd like some advice
>which ones I should post.
>
>But simply speaking: Gluster peers should get no quota voting rights on
>volumes unless they contribute bricks. That rule seems broken.
>
>Those in the know, please let me know if am on a goose chase or if
>there is a real issue here that deserves a bug report.
>___
>Users mailing list -- users@ovirt.org
>To unsubscribe send an email to users-le...@ovirt.org
>Privacy Statement: https://www.ovirt.org/privacy-policy.html
>oVirt Code of Conduct:
>https://www.ovirt.org/community/about/community-guidelines/
>List Archives:
>https://lists.ovirt.org/archives/list/users@ovirt.org/message/IBHZ674FMWFDSZYLTOMVAJU2JNM4K6OL/

I have skipped a huge  part of your e-mail cause it was too long (don't get 
offended).

Can you summarize in one (or 2  ) sentences what exactly is the problem ?
Is the UI not detecting the Gluster status, quota is preventing you to start 
VMs  or something else ?

Best Rergards,
Strahil Nikolov
___
Users mailing list -- users@ovirt.org
To