Hello,

On Mon, Feb 7, 2022 at 2:25 PM Yedidyah Bar David <d...@redhat.com> wrote:

> On Mon, Feb 7, 2022 at 1:27 PM Gilboa Davara <gilb...@gmail.com> wrote:
> >
> > Hello,
> >
> > On Mon, Feb 7, 2022 at 8:45 AM Yedidyah Bar David <d...@redhat.com>
> wrote:
> >>
> >> On Sun, Feb 6, 2022 at 5:09 PM Gilboa Davara <gilb...@gmail.com> wrote:
> >> >
> >> > Unlike my predecessor, I not only lost my vmengine, I also lost the
> vdsm services on all hosts.
> >> > All seem to be hitting the same issue - read, the certs under
> /etc/pki/vdsm/certs and /etc/pki/ovirt* all expired a couple of days ago.
> >> > As such, the hosted engine cannot go into global maintenance mode,
> >>
> >> What do you mean by that? What happens if you 'hosted-engine
> >> --set-maintenance --mode=global'?
> >
> >
> > Failed, stating the cluster is not in global maintenance mode.
>
> Please clarify, and/or share relevant logs, if you have them.
>

Sadly enough, no.
When I zapped the old vmegine and hosts configuration, I forgot to save the
logs.
(In my defense, it was 4am...)

That said, the fix proposed in BZ#1700460 (Let the user skip the global
maintenance check) might have saved my cluster.


>
> You had a semi-working existing HE cluster.
> You ran engine-backup on it, took a backup, while it was _not_ in
> global maintenance.
>

It was rather odd.
One of the hosts was still active and running the HE engine.
After I updated the apache certs, I could connect to the WebUI, but the
WebUI failed to access the nodes, spewing SSL handshake errors.
I then processed to replace the hosts certs, which seems to work, (E.g.
vdsm-client Host getCapabilities worked), hosted-engine --vm-status worked
and I could see all 3 hosts, but the engine failed to communicate with the
hosts, hence, even though I had a working cluster and engine, and I could
get the cluster into global maintenance mode, engine-setup --offline
continued to spew "not-in-global-maintenance-mode' errors.
At this stage I decided to simply zap the hosted engine and
ovirt-hosted-engine-cleanup the hosts.

As my brain was half dead, I decided to do a fresh deployment, and not use
the daily backup.



> That's ok and expected.
>
> Then you took one of the hosts and evacuated it (or just a new one),
> (re)installed the OS (or somehow cleaned it up), and ran
> 'hosted-engine --deploy --import-from-file' with the backup you took.
> This failed? Where exactly and with what error?
>

Didn't use the backup.
Clean hosted-engine --deploy failed due to qemu-6.1 failure. (I believe
it's a known BZ#).
Once I remembered to downgrade it to 6.0, everything worked as advertised
(minus one export domain, see another email).


>
> If it's the engine-setup running inside the engine VM, with the same
> error as when running 'engine-setup' (perhaps with --offline) manually,
> then this shouldn't happen at this point:
> - engine-backup --mode=restore sets vdc option in the db 'DbJustRestored'
> - engine-setup checks this and sets its own env[JUST_RESTORED] accordingly
>
> > (Understandable, given two of 3 hosts were offline due to certificate
> issues...)
> >
> >
> >>
> >>
> >> > preventing engine-setup --offline from running.
> >>
> >> Actually just a few days ago I pushed a patch for:
> >>
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1700460
> >>
> >> But:
> >>
> >> If you really have a problem that you can't set global maintenance,
> >> using this is a risk - HA might intervene in the middle and shutdown
> >> the VM. So either make sure global maintenance does work, or stop
> >> all HA services on all hosts.
> >>
> >> > Two questions:
> >> > 1. Is there any automated method to renew the vdsm certificates?
> >>
> >> You mean, without an engine?
> >>
> >> I think that if you have a functional engine one way or another,
> >> you can automate this somehow, didn't check. Try checking e.g. the
> >> python sdk examples - there might be there something you can base
> >> on.
> >>
> >> > 2. Assuming the previous answer is "no", assuming I'm somewhat versed
> in using openssl, how can I manually renew them?
> >>
> >> I'd rather not try to invent from memory how this is supposed to work,
> >> and doing this methodically and verifying before replying is quite
> >> an effort.
> >>
> >> If this is really what you want, I suggest something like:
> >>
> >> 1. Set up a test env with an engine and one host
> >> 2. Backup (or use git on) /etc on both
> >> 3. Renew the host cert from the UI
> >> 4. Check what changed
> >>
> >> You should find, IMO, that the key(s) on the host didn't
> >> change. I guess you might also find CSRs on one or both of them.
> >> So basically it should be something like:
> >> 1. Create a CSR on the host for the existing key (one or more,
> >> not sure).
> >> 2. Copy and sign this on the engine using pki-enroll-request.sh
> >> (I think you can find examples for it scattered around, perhaps
> >> even in the main guides)
> >> 3. Copy back the generated certs to the host
> >> 4. Perhaps restart one or more services there (vdsm, imageio?,
> >> ovn, etc.)
> >>
> >> You can check the code in
> >> /usr/share/ovirt-engine/ansible-runner-service-project/project
> >> to see how it's done when initiated from the UI.
> >>
> >> Good luck and best regards,
> >
> >
> > I more of less found a document stating the above somewhere in the
> middle of the night.
> > Tried it.
> > Got the WebUI working again.
> > However, for the life of me I couldn't get the hosts to work to talk to
> the engine. (Even though I could use openssl s_client -showcerts -connect
> host and got valid certs).
> > In the end, @around ~4am, I decided to take the brute force route, clean
> the hosts, upgrade them to -streams, and redeploy the engine again (3'rd
> attempt, after sufficient amount of coffee reminded me the qemu-6.1 is
> broken, and needed to be downgraded before trying to deploy the HE...).
> > Either way, when I finish importing the VMs, I'll open a RFE to add
> BIG-WARNING-IN-BOLD-LETTERS in the WebUI to notify the admin that the
> certificates are about to expire.
>
> You should have already received them, no?
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1258585
>
> Best regards,
> --
> Didi
>
>
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/MSFGRKYTMSLK43L75DPUQYE5B3N2WNFR/

Reply via email to