[ovirt-users] Re: Hyperconverged engine high availability?

Thomas Hoberg Wed, 31 Mar 2021 14:41:20 -0700

It's important to understand the oVirt design philosophy.

That may be somewhat understated in the documentation, because I am afraid they 
copied that from VMware's vSphere who might have copied it from Nutanix, who 
might have copied it from who-know-else... which might explain why they are a 
little shy about it.

The basic truth is: HA is a chicken and egg issue. Having several management
engines won't get you HA, because in a case of conflict, these HA engines can't
easily decide who is boss.

Which is why oVirt (and vSphere/Nutanix most likely) will concede defeat (or
death) on every startup.

What that mostly means is that you don't really need to ensure that a smart and
higly complex management machine, which can juggle dozens of infrastructure
pieces against an optimal plan of operations, is in fact at all times highly
available.

It's quite enough to have the infrastructure pieces ensure that the last plan
this ME produced is faithfully executed.

So oVirt has a super intelligent management engine build a plan.
That plan is written to super primitive but reliant storage.
All hosts will faithfully (and without personal ambitions to improve) execute
that last plan, which includes launching the management engine...
And that single newly started management engine, can read the basic
infrastructure data, as well as the latest plant, to hopefully create a better
new plan, before it dies...

And that's why, unless your ME always dies before a new plan can be created,
you don't need HA for the ME: It's sufficient to have a good-enough plan for
all hosts.

Like far too many clusters, oVirt relegates HA to a passive storage device,
that is always highly available. With SANs and NFS filers, that's hopefully
solved in hardware. With HCI Glusters it's done with majority voting, hopefully.

All that said...

I've rarely had all 3 nodes register just perfectly in a 3 node oVirt HCI
cluster. I don't have any idea why that is the case in both 4.3 and 4.4.

I have almost always had to add the two additional node via 'add host' to make
them available both as compute nodes and as Gluster peers. On the other hand,
it just works and doing it twice or a hundred times, wont break a thing.

And that is true for almost every component of oVirt: practically all services
can fail, or be restarted at any time, without causing a major disruption or
outrigth failure. That's where I can't but admire this fail-safe approach
(which somewhat unfortunately might not have been invented at Redhat, even if
Moshe Bar most likely had a hand in it).

It never hurts do make sure that you add those extra nodes with the ability to
run the management engine, either, but it's also something you can always add
later to any host (just takes Ansible patience to do so).

Today I just consider that one of dozens if not hundreds of quirks of oVirt,
that I find 'amazing' in a product also sold commercially.
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/AYAKEZMSUSPN5GIDA7HELHVRU4GFY36F/

[ovirt-users] Re: Hyperconverged engine high availability?

Reply via email to