[ovirt-users] Re: Recovery from power outage

Yedidyah Bar David Thu, 04 Feb 2021 00:22:49 -0800

On Thu, Feb 4, 2021 at 10:09 AM Roderick Mooi <[email protected]> wrote:
>
> Hi Didi!
>
> Ok, I started the clean metadata process and then found the real issue - I 
> had copied the certs (just /etc/pki/vdsm; other pki folders were intact) from 
> a working host (host 2) to host 1 following the re-deploy cleanup as part of 
> the process to get it online again. The problem is the cert contains the 
> hostname (so now the cert on host 1 contains as Subject CN the hostname of 
> host 2).


Right. Sorry I didn't remember that.

> I found some docs on the certs for libvirt but it's not clear what I need to 
> do to correctly re-generate the vdsm certs on host 1. Can you help? PS I 
> presume I need to re-generate client certs for that host as well and copy to 
> the engine?

Easiest is to put the host to maintenance, then "Enroll Certificate" -
IIRC this should be enough. If you want to make sure, perhaps better
remove all certs/keys and do 'Reinstall' instead, and make sure you
choose 'Deploy' for 'Hosted Engine'.

Good luck,

>
> Appreciated,
>
> Roderick
>
>
> On 2021/02/03 16:58, Yedidyah Bar David wrote:
> > On Wed, Feb 3, 2021 at 4:52 PM Roderick Mooi <[email protected]> wrote:
> >>
> >> Thanks,
> >>
> >>> I didn't check, but am pretty certain that it's not related to the
> >>> engine db. Do you see such duplicates there as well (using the web ui
> >>> or sql against it)? If so, fix these first. If no other means, put the
> >>> host to maintenance and reinstall with the correct name.
> >>
> >> Not seeing duplicates in the web UI, only in the --vm-status. Can you 
> >> please assist me with the sql commands or reference to the database schema 
> >> + where to check? I'd like to check that first before doing anything too 
> >> drastic.
> >
> > /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c 'select * from vds'
> >
> >>
> >> Note: it only duplicated the hostname after I changed the host_id, before 
> >> that it had the correct hostname but duplicate host_id.
> >>
> >> PS I have a recent backup of the database (just before which I could 
> >> restore if you think that'll do the trick without breaking anything?
> >>
> >>
> >> On 2021/02/03 16:33, Yedidyah Bar David wrote:
> >>> On Wed, Feb 3, 2021 at 4:21 PM Roderick Mooi <[email protected]> 
> >>> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>>> Any idea how this happened?
> >>>>
> >>>> Somehow related to the power being "pulled" at the wrong time?
> >>>>
> >>>>> Perhaps this is a backup done by emacs?
> >>>>
> >>>> Not sure what does it but I'm glad it did ;)
> >>>>
> >>>>> Please compare it to your other hosts. It should be (mostly?)
> >>>>> identical, but make sure that host_id= is unique per host. It should
> >>>>> match the spm host id for this host in the engine database.
> >>>>
> >>>> I had to restore one of my hosts (host 1) manually due a cleanup during 
> >>>> my re-deploy attempts. I managed to do this successfully by copying the 
> >>>> missing files from another host (host 2) but the first time the host ID 
> >>>> matched one of the other hosts (which made at least hosted-engine 
> >>>> --vm-status unhappy) [I hadn't seen your email yet :(]. I subsequently 
> >>>> corrected the host_id and rebooted the guilty host. Things mostly seem 
> >>>> to be working now except that in hosted-engine --vm-status my first two 
> >>>> hosts (the one I copied the .conf from as well as the one I copied it to 
> >>>> [without changing the ID :O]) now have the same hostname :-/ I'm 
> >>>> assuming there's a mismatch in the engine database - where/how do I fix 
> >>>> that?
> >>>>
> >>>
> >>> I didn't check, but am pretty certain that it's not related to the
> >>> engine db. Do you see such duplicates there as well (using the web ui
> >>> or sql against it)? If so, fix these first. If no other means, put the
> >>> host to maintenance and reinstall with the correct name.
> >>>
> >>> If it's just the shared storage, you can try the following. Carefully.
> >>> Didn't try myself. Try on a test system first.
> >>>
> >>> 1. Set global maintenance
> >>>
> >>> 2. Stop ovirt-ha-agent, ovirt-ha-broker, perhaps also vdsmd, supervdsmd
> >>>
> >>> 3. hosted-engine --clean_metadata --host-id=1
> >>>
> >>> - Perhaps even pass --force-cleanup, not sure when it's needed
> >>>
> >>> - Repeat for other IDs as needed
> >>>
> >>> 4. Start ovirt-ha-agent (I think this should start all the others, but
> >>> make sure)
> >>>
> >>> 5. Wait a bit. I am pretty certain that they should recreate their
> >>> entries in the shared storage and eventually --vm-status should look
> >>> ok.
> >>>
> >>> 6. Exit global maintenance
> >>>
> >>> Good luck,
> >>>
> >>>> Appreciated! (and happy cos our cluster is almost back to normal :) )
> >>>>
> >>>> On 2021/02/03 11:30, Yedidyah Bar David wrote:
> >>>>> On Wed, Feb 3, 2021 at 11:12 AM Roderick Mooi <[email protected]> 
> >>>>> wrote:
> >>>>>>
> >>>>>> Hello and thanks for assisting!
> >>>>>>
> >>>>>> I think I may have found the problem :)
> >>>>>>
> >>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf
> >>>>>>
> >>>>>> is blank.
> >>>>>>
> >>>>>> But I do have hosted-engine.conf~
> >>>>>
> >>>>> Any idea how this happened?
> >>>>>
> >>>>> Perhaps this is a backup done by emacs?
> >>>>>
> >>>>>>
> >>>>>> Can I cp this to restore the original?
> >>>>>
> >>>>> Please compare it to your other hosts. It should be (mostly?)
> >>>>> identical, but make sure that host_id= is unique per host. It should
> >>>>> match the spm host id for this host in the engine database.
> >>>>>
> >>>>>>
> >>>>>> Anything else I need to do?
> >>>>>
> >>>>> Not sure, but better find the root cause to make sure no other damage 
> >>>>> was done.
> >>>>>
> >>>>> Good luck,
> >>>>>
> >>>>>>
> >>>>>> Appreciated
> >>>>>>
> >>>>>>
> >>>>>> On 2021/02/02 11:37, Strahil Nikolov wrote:
> >>>>>>> Usually,
> >>>>>>>
> >>>>>>> I would start with checking the output of the 
> >>>>>>> /var/log/ovirt-hosted-engine-ha/{broker,agent}.log
> >>>>>>>
> >>>>>>> I'm typing it on my phone, so the path could have a typo.
> >>>>>>>
> >>>>>>> Check if the following services (also typed by memory, might have to 
> >>>>>>> remove the 'd') are running:
> >>>>>>> - sanlock
> >>>>>>> - supervdsmd
> >>>>>>> - vdsmd
> >>>>>>>
> >>>>>>>
> >>>>>>> Sometimes, some of my VGs (gluster) are not activated, so if you run 
> >>>>>>> hyperconverged -> you can 'vgchange -ay'.
> >>>>>>>
> >>>>>>> Best Regards,
> >>>>>>> Strahil Nikolov
> >>>>>>>
> >>>>>>>
> >>>>>>> Sent from Yahoo Mail on Android 
> >>>>>>> <https://go.onelink.me/107872968?pid=InProduct&c=Global_Internal_YGrowth_AndroidEmailSig__AndroidUsers&af_wl=ym&af_sub1=Internal&af_sub2=Global_YGrowth&af_sub3=EmailSignature>
> >>>>>>>
> >>>>>>>        On Tue, Feb 2, 2021 at 11:28, Roderick Mooi
> >>>>>>>        <[email protected]> wrote:
> >>>>>>>        Hi!
> >>>>>>>
> >>>>>>>        We had a power outage and all our servers (oVirt hosts) went 
> >>>>>>> down. When they started up neither the hosted-engine nor VMs were 
> >>>>>>> started.
> >>>>>>>
> >>>>>>>        hosted-engine --vm-status
> >>>>>>>        says:
> >>>>>>>        You must run deploy first
> >>>>>>>
> >>>>>>>        I tried running deploy with various options but ultimately get 
> >>>>>>> stuck at:
> >>>>>>>
> >>>>>>>        The Host ID is already known. Is this a re-deployment on an 
> >>>>>>> additional host that was previously set up (Yes, No)[Yes]?
> >>>>>>>        ...
> >>>>>>>        [ ERROR ] Failed to execute stage 'Closing up': <urlopen error 
> >>>>>>> [Errno 113] No route to host>
> >>>>>>>
> >>>>>>>        OR
> >>>>>>>
> >>>>>>>        The specified storage location already contains a data domain. 
> >>>>>>> Is this an additional host setup (Yes, No)[Yes]? No
> >>>>>>>        [ ERROR ] Re-deploying the engine VM over a previously 
> >>>>>>> (partially) deployed system is not supported. Please clean up the 
> >>>>>>> storage device or select a different one and retry.
> >>>>>>>
> >>>>>>>        NOTES:
> >>>>>>>        1. This is oVirt v3.6 (legacy install, I know...)
> >>>>>>>        2. We do have daily engine backups (.bak files) [till the day 
> >>>>>>> the power failed]
> >>>>>>>
> >>>>>>>        Any advice/assistance appreciated.
> >>>>>>>
> >>>>>>>        Thanks!
> >>>>>>>
> >>>>>>>        Roderick
> >>>>>>>        _______________________________________________
> >>>>>>>        Users mailing list -- [email protected] <mailto:[email protected]>
> >>>>>>>        To unsubscribe send an email to [email protected] 
> >>>>>>> <mailto:[email protected]>
> >>>>>>>        Privacy Statement: https://www.ovirt.org/privacy-policy.html 
> >>>>>>> <https://www.ovirt.org/privacy-policy.html>
> >>>>>>>        oVirt Code of Conduct: 
> >>>>>>> https://www.ovirt.org/community/about/community-guidelines/ 
> >>>>>>> <https://www.ovirt.org/community/about/community-guidelines/>
> >>>>>>>        List Archives:
> >>>>>>>        
> >>>>>>> https://lists.ovirt.org/archives/list/[email protected]/message/73VDY7KLYBKCUXOUU4YTS4ZFGXN2ZX2U/
> >>>>>>>  
> >>>>>>> <https://lists.ovirt.org/archives/list/[email protected]/message/73VDY7KLYBKCUXOUU4YTS4ZFGXN2ZX2U/>
> >>>>>>>
> >>>>>> _______________________________________________
> >>>>>> Users mailing list -- [email protected]
> >>>>>> To unsubscribe send an email to [email protected]
> >>>>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> >>>>>> oVirt Code of Conduct: 
> >>>>>> https://www.ovirt.org/community/about/community-guidelines/
> >>>>>> List Archives: 
> >>>>>> https://lists.ovirt.org/archives/list/[email protected]/message/HTWNERBX42JNOMONSCG6BL2MCIQZDW7C/
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>


-- 
Didi
_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/BIZFTSQGJHVVMXGA2TDWHLCBQ4I4VE34/

[ovirt-users] Re: Recovery from power outage

Reply via email to