[ovirt-users] Re: Fix corrupt self-hosted engine
On Tue, Nov 24, 2020 at 12:38 PM Alex K wrote: > > > On Mon, Nov 23, 2020 at 10:09 AM Yedidyah Bar David > wrote: > >> On Mon, Nov 23, 2020 at 9:54 AM Alex K wrote: >> > >> > >> > >> > On Sun, Nov 22, 2020 at 8:57 AM Yedidyah Bar David >> wrote: >> >> >> >> On Thu, Nov 19, 2020 at 9:43 PM Alex K >> wrote: >> >>> >> >>> >> >>> >> >>> On Thu, Nov 19, 2020 at 5:31 PM Alex K >> wrote: >> >> Hi Didi, >> >> On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David >> wrote: >> > >> > On Thu, Nov 19, 2020 at 4:37 PM Alex K >> wrote: >> >> >> >> Hi all, >> >> >> >> I have a corrupt self-hosted engine (with several file system >> errors, postgres not able to start) and thus it does not give access to the >> web UI. This happened following an unlucky split brain resolution (I am >> running 2 nodes). The two hosts are running VMs also which I would like to >> keep running as they are needed. >> >> >> >> When trying to boot into rescue mode (using >> systemd.unit=emergency.target boot parameter) I get a cursor and nothing >> else. >> > >> > >> > This means that more than just the DB is corrupt... >> > >> >> >> >> >> >> I have backups of engine files with scope all (using the >> engine-backup tool). >> >> What is the best approach to try and fix the engine or redeploy. >> > >> > >> > If you are careful, and know what you are doing, you can try >> something like the following. I am not giving many details, hopefully you >> can find on the net tutorials about how to use the things I suggest: >> > >> > 1. Move to global maintenance >> > >> > 2. Stop the current dead vm (if needed) >> > >> > 3. Find current vm conf, edit it to boot from a rescue iso image of >> your preference or from net/PXE etc., and start the vm with '--vm-conf' >> pointing to your edited file. >> > >> > 4. Connect a console (hosted-engine --console, or 'virsh console', >> or use '--add-console-password' and remote viewer, if needed) >> > >> > 5. Clean the disk and install the OS, oVirt, etc. >> > >> > 6. Copy your backup into the vm and restore with engine-backup >> > >> > 7. Then cleanly stop the machine, exit global maint, and let HA >> start it (or start it yourself with --vm-start). >> > >> > At the time, we had a bug [1] to document this. The result is [2]. >> It does not detail how to boot/reinstall os/etc., only restore (if e.g. db >> is dead but fs is ok). >> > For something somewhat similar to what you want, see also [3], >> which uses guestfish. Might be useful, depending on how badly your disk is >> corrupted. >> >> I went with the guestfish approach. It has fixed some fs issues and >> now the yum etc seem fine apart from postgres. >> I had tried previously to uninstall/install packages so I ended >> installing them again with yum install ovirt\*setup\*. >> Now I think I have to run engine-setup but I get the error: >> >> Failed to execute stage 'Environment setup': Cannot connect to >> Engine database using existing credentials: engine@localhost:5432 >> >>> >> >>> Seems that I need to have psql running to be able to run >> engine-backup --mode=restore. Are there any steps how one could manually >> prepare pgsql for ovirt so as to attempt restoration? >> >> >> >> >> >> Replying again, also to conclude this part of your episode: Generally >> speaking, that's not needed. restore --provision-all-databases should do >> that for you. >> > >> > Seems that when pgsql is down nothing can be done. You need at least >> pgsql up and running (e clean state will do) so as to be able to proceed >> with restoration. >> >> Do you still have logs from this? Both engine-backup's (default to >> /var/log/ovirt-engine-backup/something if you do not pass --log) and >> ovirt-engine-provisiondb which it runs (at >> /var/log/ovirt-engine/setup). >> > I was using --provision-all-databases flag when trying to restore. I might > retest to double check. When the pgsql was down, I was getting: > > 2020-11-19 22:06:35 4947: Start of engine-backup mode restore scope all > file /var/backup/daily.0/engine-backup.gz > 2020-11-19 22:06:35 4947: OUTPUT: Start of engine-backup with mode > 'restore' > 2020-11-19 22:06:35 4947: OUTPUT: scope: all > 2020-11-19 22:06:35 4947: OUTPUT: archive file: > /var/backup/daily.0/engine-backup.gz > 2020-11-19 22:06:35 4947: OUTPUT: log file: restore.log > 2020-11-19 22:06:35 4947: Setting scl env for rh-postgresql10 > 2020-11-19 22:06:35 4947: OUTPUT: Preparing to restore: > 2020-11-19 22:06:35 4947: OUTPUT: - Unpacking file > '/var/backup/daily.0/engine-backup.gz' > 2020-11-19 22:06:35 4947: Opening tarball > /var/backup/daily.0/engine-backup.gz to /tmp/engine-backup.63eeNqt4NH > 2020-11-19 22:06:35 4947: Verifying hash > 2020-11-19 22:06:35 4947: Verifying version > 2020-11-19 22:06:35 4947: Reading config > 2020-11-19 22:06:35 4947: OUTPUT:
[ovirt-users] Re: Fix corrupt self-hosted engine
On Mon, Nov 23, 2020 at 10:09 AM Yedidyah Bar David wrote: > On Mon, Nov 23, 2020 at 9:54 AM Alex K wrote: > > > > > > > > On Sun, Nov 22, 2020 at 8:57 AM Yedidyah Bar David > wrote: > >> > >> On Thu, Nov 19, 2020 at 9:43 PM Alex K wrote: > >>> > >>> > >>> > >>> On Thu, Nov 19, 2020 at 5:31 PM Alex K > wrote: > > Hi Didi, > > On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David > wrote: > > > > On Thu, Nov 19, 2020 at 4:37 PM Alex K > wrote: > >> > >> Hi all, > >> > >> I have a corrupt self-hosted engine (with several file system > errors, postgres not able to start) and thus it does not give access to the > web UI. This happened following an unlucky split brain resolution (I am > running 2 nodes). The two hosts are running VMs also which I would like to > keep running as they are needed. > >> > >> When trying to boot into rescue mode (using > systemd.unit=emergency.target boot parameter) I get a cursor and nothing > else. > > > > > > This means that more than just the DB is corrupt... > > > >> > >> > >> I have backups of engine files with scope all (using the > engine-backup tool). > >> What is the best approach to try and fix the engine or redeploy. > > > > > > If you are careful, and know what you are doing, you can try > something like the following. I am not giving many details, hopefully you > can find on the net tutorials about how to use the things I suggest: > > > > 1. Move to global maintenance > > > > 2. Stop the current dead vm (if needed) > > > > 3. Find current vm conf, edit it to boot from a rescue iso image of > your preference or from net/PXE etc., and start the vm with '--vm-conf' > pointing to your edited file. > > > > 4. Connect a console (hosted-engine --console, or 'virsh console', > or use '--add-console-password' and remote viewer, if needed) > > > > 5. Clean the disk and install the OS, oVirt, etc. > > > > 6. Copy your backup into the vm and restore with engine-backup > > > > 7. Then cleanly stop the machine, exit global maint, and let HA > start it (or start it yourself with --vm-start). > > > > At the time, we had a bug [1] to document this. The result is [2]. > It does not detail how to boot/reinstall os/etc., only restore (if e.g. db > is dead but fs is ok). > > For something somewhat similar to what you want, see also [3], which > uses guestfish. Might be useful, depending on how badly your disk is > corrupted. > > I went with the guestfish approach. It has fixed some fs issues and > now the yum etc seem fine apart from postgres. > I had tried previously to uninstall/install packages so I ended > installing them again with yum install ovirt\*setup\*. > Now I think I have to run engine-setup but I get the error: > > Failed to execute stage 'Environment setup': Cannot connect to > Engine database using existing credentials: engine@localhost:5432 > >>> > >>> Seems that I need to have psql running to be able to run engine-backup > --mode=restore. Are there any steps how one could manually prepare pgsql > for ovirt so as to attempt restoration? > >> > >> > >> Replying again, also to conclude this part of your episode: Generally > speaking, that's not needed. restore --provision-all-databases should do > that for you. > > > > Seems that when pgsql is down nothing can be done. You need at least > pgsql up and running (e clean state will do) so as to be able to proceed > with restoration. > > Do you still have logs from this? Both engine-backup's (default to > /var/log/ovirt-engine-backup/something if you do not pass --log) and > ovirt-engine-provisiondb which it runs (at > /var/log/ovirt-engine/setup). > I was using --provision-all-databases flag when trying to restore. I might retest to double check. When the pgsql was down, I was getting: 2020-11-19 22:06:35 4947: Start of engine-backup mode restore scope all file /var/backup/daily.0/engine-backup.gz 2020-11-19 22:06:35 4947: OUTPUT: Start of engine-backup with mode 'restore' 2020-11-19 22:06:35 4947: OUTPUT: scope: all 2020-11-19 22:06:35 4947: OUTPUT: archive file: /var/backup/daily.0/engine-backup.gz 2020-11-19 22:06:35 4947: OUTPUT: log file: restore.log 2020-11-19 22:06:35 4947: Setting scl env for rh-postgresql10 2020-11-19 22:06:35 4947: OUTPUT: Preparing to restore: 2020-11-19 22:06:35 4947: OUTPUT: - Unpacking file '/var/backup/daily.0/engine-backup.gz' 2020-11-19 22:06:35 4947: Opening tarball /var/backup/daily.0/engine-backup.gz to /tmp/engine-backup.63eeNqt4NH 2020-11-19 22:06:35 4947: Verifying hash 2020-11-19 22:06:35 4947: Verifying version 2020-11-19 22:06:35 4947: Reading config 2020-11-19 22:06:35 4947: OUTPUT: Restoring: 2020-11-19 22:06:35 4947: OUTPUT: - Files 2020-11-19 22:06:35 4947: Restoring files 2020-11-19 22:06:36 4947: Reloading configuration 2020-11-19 22:06:36 4947: Generating pgpass 2020-11-19
[ovirt-users] Re: Fix corrupt self-hosted engine
On Mon, Nov 23, 2020 at 9:54 AM Alex K wrote: > > > > On Sun, Nov 22, 2020 at 8:57 AM Yedidyah Bar David wrote: >> >> On Thu, Nov 19, 2020 at 9:43 PM Alex K wrote: >>> >>> >>> >>> On Thu, Nov 19, 2020 at 5:31 PM Alex K wrote: Hi Didi, On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David wrote: > > On Thu, Nov 19, 2020 at 4:37 PM Alex K wrote: >> >> Hi all, >> >> I have a corrupt self-hosted engine (with several file system errors, >> postgres not able to start) and thus it does not give access to the web >> UI. This happened following an unlucky split brain resolution (I am >> running 2 nodes). The two hosts are running VMs also which I would like >> to keep running as they are needed. >> >> When trying to boot into rescue mode (using >> systemd.unit=emergency.target boot parameter) I get a cursor and nothing >> else. > > > This means that more than just the DB is corrupt... > >> >> >> I have backups of engine files with scope all (using the engine-backup >> tool). >> What is the best approach to try and fix the engine or redeploy. > > > If you are careful, and know what you are doing, you can try something > like the following. I am not giving many details, hopefully you can find > on the net tutorials about how to use the things I suggest: > > 1. Move to global maintenance > > 2. Stop the current dead vm (if needed) > > 3. Find current vm conf, edit it to boot from a rescue iso image of your > preference or from net/PXE etc., and start the vm with '--vm-conf' > pointing to your edited file. > > 4. Connect a console (hosted-engine --console, or 'virsh console', or use > '--add-console-password' and remote viewer, if needed) > > 5. Clean the disk and install the OS, oVirt, etc. > > 6. Copy your backup into the vm and restore with engine-backup > > 7. Then cleanly stop the machine, exit global maint, and let HA start it > (or start it yourself with --vm-start). > > At the time, we had a bug [1] to document this. The result is [2]. It > does not detail how to boot/reinstall os/etc., only restore (if e.g. db > is dead but fs is ok). > For something somewhat similar to what you want, see also [3], which uses > guestfish. Might be useful, depending on how badly your disk is corrupted. I went with the guestfish approach. It has fixed some fs issues and now the yum etc seem fine apart from postgres. I had tried previously to uninstall/install packages so I ended installing them again with yum install ovirt\*setup\*. Now I think I have to run engine-setup but I get the error: Failed to execute stage 'Environment setup': Cannot connect to Engine database using existing credentials: engine@localhost:5432 >>> >>> Seems that I need to have psql running to be able to run engine-backup >>> --mode=restore. Are there any steps how one could manually prepare pgsql >>> for ovirt so as to attempt restoration? >> >> >> Replying again, also to conclude this part of your episode: Generally >> speaking, that's not needed. restore --provision-all-databases should do >> that for you. > > Seems that when pgsql is down nothing can be done. You need at least pgsql up > and running (e clean state will do) so as to be able to proceed with > restoration. Do you still have logs from this? Both engine-backup's (default to /var/log/ovirt-engine-backup/something if you do not pass --log) and ovirt-engine-provisiondb which it runs (at /var/log/ovirt-engine/setup). Not sure what you mean in "a clean state will do". If you just install PG, it is not enabled by default, so is not "up and running". Generally speaking: If you never started/inited PG (e.g. on a clean machine), restore, with --provision-all-databases, does this for you. Are you sure you passed this? If you did, and created DB/user with the same name it wants to restore to, but left the DB empty, it will use it. If you populated the DB, it will fail with a suitable error message. These are the states that are intended to be supported. Anything else might break it in other ways. >> >> >> I replied to all your interim emails in private, since you replied in >> private. > > Did not notice I was replying in private :) NP :-) >> >> >> Thanks for the final message to the list. >> >> It would be nice if you send another summary of the main obstacles you ran >> into, what worked and didn't work, and especially what ideas you can think >> of to improve the code/doc for the next time something similar happens (also >> to you :-) ). >> >> If you feel like that, and have time, it sounds like a nice opportunity for >> a blog post :-) (I know I (almost?) never wrote any myself, sorry, but I >> like reading them - and they are much more approachable and useful, over the >> long
[ovirt-users] Re: Fix corrupt self-hosted engine
On Sun, Nov 22, 2020 at 8:57 AM Yedidyah Bar David wrote: > On Thu, Nov 19, 2020 at 9:43 PM Alex K wrote: > >> >> >> On Thu, Nov 19, 2020 at 5:31 PM Alex K wrote: >> >>> Hi Didi, >>> >>> On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David >>> wrote: >>> On Thu, Nov 19, 2020 at 4:37 PM Alex K wrote: > Hi all, > > I have a corrupt self-hosted engine (with several file system errors, > postgres not able to start) and thus it does not give access to the web > UI. > This happened following an unlucky split brain resolution (I am running 2 > nodes). The two hosts are running VMs also which I would like to keep > running as they are needed. > > When trying to boot into rescue mode (using > systemd.unit=emergency.target boot parameter) I get a cursor and nothing > else. > This means that more than just the DB is corrupt... > > I have backups of engine files with scope all (using the engine-backup > tool). > What is the best approach to try and fix the engine or redeploy. > If you are careful, and know what you are doing, you can try something like the following. I am not giving many details, hopefully you can find on the net tutorials about how to use the things I suggest: 1. Move to global maintenance 2. Stop the current dead vm (if needed) 3. Find current vm conf, edit it to boot from a rescue iso image of your preference or from net/PXE etc., and start the vm with '--vm-conf' pointing to your edited file. 4. Connect a console (hosted-engine --console, or 'virsh console', or use '--add-console-password' and remote viewer, if needed) 5. Clean the disk and install the OS, oVirt, etc. 6. Copy your backup into the vm and restore with engine-backup 7. Then cleanly stop the machine, exit global maint, and let HA start it (or start it yourself with --vm-start). At the time, we had a bug [1] to document this. The result is [2]. It does not detail how to boot/reinstall os/etc., only restore (if e.g. db is dead but fs is ok). For something somewhat similar to what you want, see also [3], which uses guestfish. Might be useful, depending on how badly your disk is corrupted. >>> I went with the guestfish approach. It has fixed some fs issues and now >>> the yum etc seem fine apart from postgres. >>> I had tried previously to uninstall/install packages so I ended >>> installing them again with yum install ovirt\*setup\*. >>> Now I think I have to run engine-setup but I get the error: >>> >>> Failed to execute stage 'Environment setup': Cannot connect to Engine >>> database using existing credentials: engine@localhost:5432 >>> >> Seems that I need to have psql running to be able to run engine-backup >> --mode=restore. Are there any steps how one could manually prepare pgsql >> for ovirt so as to attempt restoration? >> > > Replying again, also to conclude this part of your episode: Generally > speaking, that's not needed. restore --provision-all-databases should do > that for you. > Seems that when pgsql is down nothing can be done. You need at least pgsql up and running (e clean state will do) so as to be able to proceed with restoration. > > I replied to all your interim emails in private, since you replied in > private. > Did not notice I was replying in private :) > > Thanks for the final message to the list. > > It would be nice if you send another summary of the main obstacles you ran > into, what worked and didn't work, and especially what ideas you can think > of to improve the code/doc for the next time something similar happens > (also to you :-) ). > > If you feel like that, and have time, it sounds like a nice opportunity > for a blog post :-) (I know I (almost?) never wrote any myself, sorry, but > I like reading them - and they are much more approachable and useful, over > the long run, compared to just posting to the list). > Noted. Will check to put this in a blog. Generally the missing part from the docs was that one cannot proceed with the restoration if pgsql is not able to start. So I had to clean re-install pgsql and initialize its data store before proceeding with the restoration. > > Best regards, > > >> >>> So I guess I need to follow [2]. What do you think? >>> >>> How did you run into a split brain? There is a lock on the shared storage that should prevent this. Good luck and best regards, [1] https://bugzilla.redhat.com/show_bug.cgi?id=1482710 [2] https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_Self-Hosted_Engine [3] https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4 -- Didi >>> > > -- > Didi > ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement:
[ovirt-users] Re: Fix corrupt self-hosted engine
On Thu, Nov 19, 2020 at 11:33 PM Alex K wrote: > For the records, > > After having fixed the major fs issues with guestfish and since the DB was > not starting up, I removed everything from DB data dir and recreated it as > below: > > rm -rf /var/opt/rh/rh-postgresql10/lib/pgsql/data/* > /opt/rh/rh-postgresql10/root/usr/bin/postgresql-setup --initdb > systemctl restart rh-postgresql10-postgresql.service > Generally speaking, this should not be needed. --provision-all-databases should do this for you. > > Then proceeded with the restoration, where I requested to provision all > missing databases: > engine-backup --mode=restore --file=engine-backup.gz > --provision-all-databases \ > --log=restore.log --restore-permissions > > Following this, ran engine-setup, as instructed from the restore > operation. > Gained engine web access and saw the same running VMs were shown as up > without issues. > I only observed one VM not able to start due to illegal volume, but that's > another story. > Glad to hear that, thanks for the report! Best regards, > > > On Thu, Nov 19, 2020 at 9:42 PM Alex K wrote: > >> >> >> On Thu, Nov 19, 2020 at 5:31 PM Alex K wrote: >> >>> Hi Didi, >>> >>> On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David >>> wrote: >>> On Thu, Nov 19, 2020 at 4:37 PM Alex K wrote: > Hi all, > > I have a corrupt self-hosted engine (with several file system errors, > postgres not able to start) and thus it does not give access to the web > UI. > This happened following an unlucky split brain resolution (I am running 2 > nodes). The two hosts are running VMs also which I would like to keep > running as they are needed. > > When trying to boot into rescue mode (using > systemd.unit=emergency.target boot parameter) I get a cursor and nothing > else. > This means that more than just the DB is corrupt... > > I have backups of engine files with scope all (using the engine-backup > tool). > What is the best approach to try and fix the engine or redeploy. > If you are careful, and know what you are doing, you can try something like the following. I am not giving many details, hopefully you can find on the net tutorials about how to use the things I suggest: 1. Move to global maintenance 2. Stop the current dead vm (if needed) 3. Find current vm conf, edit it to boot from a rescue iso image of your preference or from net/PXE etc., and start the vm with '--vm-conf' pointing to your edited file. 4. Connect a console (hosted-engine --console, or 'virsh console', or use '--add-console-password' and remote viewer, if needed) 5. Clean the disk and install the OS, oVirt, etc. 6. Copy your backup into the vm and restore with engine-backup 7. Then cleanly stop the machine, exit global maint, and let HA start it (or start it yourself with --vm-start). At the time, we had a bug [1] to document this. The result is [2]. It does not detail how to boot/reinstall os/etc., only restore (if e.g. db is dead but fs is ok). For something somewhat similar to what you want, see also [3], which uses guestfish. Might be useful, depending on how badly your disk is corrupted. >>> I went with the guestfish approach. It has fixed some fs issues and now >>> the yum etc seem fine apart from postgres. >>> I had tried previously to uninstall/install packages so I ended >>> installing them again with yum install ovirt\*setup\*. >>> Now I think I have to run engine-setup but I get the error: >>> >>> Failed to execute stage 'Environment setup': Cannot connect to Engine >>> database using existing credentials: engine@localhost:5432 >>> >> Seems that I need to have psql running to be able to run engine-backup >> --mode=restore. Are there any steps how one could manually prepare pgsql >> for ovirt so as to attempt restoration? >> >>> >>> So I guess I need to follow [2]. What do you think? >>> >>> How did you run into a split brain? There is a lock on the shared storage that should prevent this. Good luck and best regards, [1] https://bugzilla.redhat.com/show_bug.cgi?id=1482710 [2] https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_Self-Hosted_Engine [3] https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4 -- Didi >>> ___ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/privacy-policy.html > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@ovirt.org/message/SU6V565Y5GAZ67FF5MUDGFLEJ2L2LZV7/ > -- Didi ___ Users mailing list --
[ovirt-users] Re: Fix corrupt self-hosted engine
For the records, After having fixed the major fs issues with guestfish and since the DB was not starting up, I removed everything from DB data dir and recreated it as below: rm -rf /var/opt/rh/rh-postgresql10/lib/pgsql/data/* /opt/rh/rh-postgresql10/root/usr/bin/postgresql-setup --initdb systemctl restart rh-postgresql10-postgresql.service Then proceeded with the restoration, where I requested to provision all missing databases: engine-backup --mode=restore --file=engine-backup.gz --provision-all-databases \ --log=restore.log --restore-permissions Following this, ran engine-setup, as instructed from the restore operation. Gained engine web access and saw the same running VMs were shown as up without issues. I only observed one VM not able to start due to illegal volume, but that's another story. On Thu, Nov 19, 2020 at 9:42 PM Alex K wrote: > > > On Thu, Nov 19, 2020 at 5:31 PM Alex K wrote: > >> Hi Didi, >> >> On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David >> wrote: >> >>> On Thu, Nov 19, 2020 at 4:37 PM Alex K wrote: >>> Hi all, I have a corrupt self-hosted engine (with several file system errors, postgres not able to start) and thus it does not give access to the web UI. This happened following an unlucky split brain resolution (I am running 2 nodes). The two hosts are running VMs also which I would like to keep running as they are needed. When trying to boot into rescue mode (using systemd.unit=emergency.target boot parameter) I get a cursor and nothing else. >>> >>> This means that more than just the DB is corrupt... >>> >>> I have backups of engine files with scope all (using the engine-backup tool). What is the best approach to try and fix the engine or redeploy. >>> >>> If you are careful, and know what you are doing, you can try something >>> like the following. I am not giving many details, hopefully you can find on >>> the net tutorials about how to use the things I suggest: >>> >>> 1. Move to global maintenance >>> >>> 2. Stop the current dead vm (if needed) >>> >>> 3. Find current vm conf, edit it to boot from a rescue iso image of your >>> preference or from net/PXE etc., and start the vm with '--vm-conf' pointing >>> to your edited file. >>> >>> 4. Connect a console (hosted-engine --console, or 'virsh console', or >>> use '--add-console-password' and remote viewer, if needed) >>> >>> 5. Clean the disk and install the OS, oVirt, etc. >>> >>> 6. Copy your backup into the vm and restore with engine-backup >>> >>> 7. Then cleanly stop the machine, exit global maint, and let HA start it >>> (or start it yourself with --vm-start). >>> >>> At the time, we had a bug [1] to document this. The result is [2]. It >>> does not detail how to boot/reinstall os/etc., only restore (if e.g. db is >>> dead but fs is ok). >>> For something somewhat similar to what you want, see also [3], which >>> uses guestfish. Might be useful, depending on how badly your disk is >>> corrupted. >>> >> I went with the guestfish approach. It has fixed some fs issues and now >> the yum etc seem fine apart from postgres. >> I had tried previously to uninstall/install packages so I ended >> installing them again with yum install ovirt\*setup\*. >> Now I think I have to run engine-setup but I get the error: >> >> Failed to execute stage 'Environment setup': Cannot connect to Engine >> database using existing credentials: engine@localhost:5432 >> > Seems that I need to have psql running to be able to run engine-backup > --mode=restore. Are there any steps how one could manually prepare pgsql > for ovirt so as to attempt restoration? > >> >> So I guess I need to follow [2]. What do you think? >> >> >>> How did you run into a split brain? There is a lock on the shared >>> storage that should prevent this. >>> >>> Good luck and best regards, >>> >>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1482710 >>> [2] >>> https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_Self-Hosted_Engine >>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4 >>> -- >>> Didi >>> >> ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/SU6V565Y5GAZ67FF5MUDGFLEJ2L2LZV7/
[ovirt-users] Re: Fix corrupt self-hosted engine
On Thu, Nov 19, 2020 at 5:12 PM Yedidyah Bar David wrote: > On Thu, Nov 19, 2020 at 4:37 PM Alex K wrote: > >> Hi all, >> >> I have a corrupt self-hosted engine (with several file system errors, >> postgres not able to start) and thus it does not give access to the web UI. >> This happened following an unlucky split brain resolution (I am running 2 >> nodes). The two hosts are running VMs also which I would like to keep >> running as they are needed. >> >> When trying to boot into rescue mode (using systemd.unit=emergency.target >> boot parameter) I get a cursor and nothing else. >> > > This means that more than just the DB is corrupt... > > >> >> I have backups of engine files with scope all (using the engine-backup >> tool). >> What is the best approach to try and fix the engine or redeploy. >> > > If you are careful, and know what you are doing, you can try something > like the following. I am not giving many details, hopefully you can find on > the net tutorials about how to use the things I suggest: > > 1. Move to global maintenance > > 2. Stop the current dead vm (if needed) > > 3. Find current vm conf, edit it to boot from a rescue iso image of your > preference or from net/PXE etc., and start the vm with '--vm-conf' pointing > to your edited file. > > 4. Connect a console (hosted-engine --console, or 'virsh console', or use > '--add-console-password' and remote viewer, if needed) > > 5. Clean the disk and install the OS, oVirt, etc. > > 6. Copy your backup into the vm and restore with engine-backup > > 7. Then cleanly stop the machine, exit global maint, and let HA start it > (or start it yourself with --vm-start). > > At the time, we had a bug [1] to document this. The result is [2]. It does > not detail how to boot/reinstall os/etc., only restore (if e.g. db is dead > but fs is ok). > For something somewhat similar to what you want, see also [3], which uses > guestfish. Might be useful, depending on how badly your disk is corrupted. > > How did you run into a split brain? There is a lock on the shared storage > that should prevent this. > Also, to clarify: The "official" answer is to deploy a new hosted-engine, on new storage, with --restore-from-file. This IMO does not let you keep your VMs up, at least not all of them, definitely if you don't have another host to restore on. Keeping the VMs up is risky if you have HA VMs, or if you started/stopped/migrated VMs after you took your backup. Best regards, -- Didi ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZCCYFBSZVUY7YJ7L5Q5U3SG4CI3CPHNN/
[ovirt-users] Re: Fix corrupt self-hosted engine
On Thu, Nov 19, 2020 at 4:37 PM Alex K wrote: > Hi all, > > I have a corrupt self-hosted engine (with several file system errors, > postgres not able to start) and thus it does not give access to the web UI. > This happened following an unlucky split brain resolution (I am running 2 > nodes). The two hosts are running VMs also which I would like to keep > running as they are needed. > > When trying to boot into rescue mode (using systemd.unit=emergency.target > boot parameter) I get a cursor and nothing else. > This means that more than just the DB is corrupt... > > I have backups of engine files with scope all (using the engine-backup > tool). > What is the best approach to try and fix the engine or redeploy. > If you are careful, and know what you are doing, you can try something like the following. I am not giving many details, hopefully you can find on the net tutorials about how to use the things I suggest: 1. Move to global maintenance 2. Stop the current dead vm (if needed) 3. Find current vm conf, edit it to boot from a rescue iso image of your preference or from net/PXE etc., and start the vm with '--vm-conf' pointing to your edited file. 4. Connect a console (hosted-engine --console, or 'virsh console', or use '--add-console-password' and remote viewer, if needed) 5. Clean the disk and install the OS, oVirt, etc. 6. Copy your backup into the vm and restore with engine-backup 7. Then cleanly stop the machine, exit global maint, and let HA start it (or start it yourself with --vm-start). At the time, we had a bug [1] to document this. The result is [2]. It does not detail how to boot/reinstall os/etc., only restore (if e.g. db is dead but fs is ok). For something somewhat similar to what you want, see also [3], which uses guestfish. Might be useful, depending on how badly your disk is corrupted. How did you run into a split brain? There is a lock on the shared storage that should prevent this. Good luck and best regards, [1] https://bugzilla.redhat.com/show_bug.cgi?id=1482710 [2] https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_Self-Hosted_Engine [3] https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4 -- Didi ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/2ALJN3CXYNC2UUCEI6H7HX3QU7YWUAML/