On Tue, Nov 24, 2020 at 12:38 PM Alex K <rightkickt...@gmail.com> wrote:
> > > On Mon, Nov 23, 2020 at 10:09 AM Yedidyah Bar David <d...@redhat.com> > wrote: > >> On Mon, Nov 23, 2020 at 9:54 AM Alex K <rightkickt...@gmail.com> wrote: >> > >> > >> > >> > On Sun, Nov 22, 2020 at 8:57 AM Yedidyah Bar David <d...@redhat.com> >> wrote: >> >> >> >> On Thu, Nov 19, 2020 at 9:43 PM Alex K <rightkickt...@gmail.com> >> wrote: >> >>> >> >>> >> >>> >> >>> On Thu, Nov 19, 2020 at 5:31 PM Alex K <rightkickt...@gmail.com> >> wrote: >> >>>> >> >>>> Hi Didi, >> >>>> >> >>>> On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David <d...@redhat.com> >> wrote: >> >>>>> >> >>>>> On Thu, Nov 19, 2020 at 4:37 PM Alex K <rightkickt...@gmail.com> >> wrote: >> >>>>>> >> >>>>>> Hi all, >> >>>>>> >> >>>>>> I have a corrupt self-hosted engine (with several file system >> errors, postgres not able to start) and thus it does not give access to the >> web UI. This happened following an unlucky split brain resolution (I am >> running 2 nodes). The two hosts are running VMs also which I would like to >> keep running as they are needed. >> >>>>>> >> >>>>>> When trying to boot into rescue mode (using >> systemd.unit=emergency.target boot parameter) I get a cursor and nothing >> else. >> >>>>> >> >>>>> >> >>>>> This means that more than just the DB is corrupt... >> >>>>> >> >>>>>> >> >>>>>> >> >>>>>> I have backups of engine files with scope all (using the >> engine-backup tool). >> >>>>>> What is the best approach to try and fix the engine or redeploy. >> >>>>> >> >>>>> >> >>>>> If you are careful, and know what you are doing, you can try >> something like the following. I am not giving many details, hopefully you >> can find on the net tutorials about how to use the things I suggest: >> >>>>> >> >>>>> 1. Move to global maintenance >> >>>>> >> >>>>> 2. Stop the current dead vm (if needed) >> >>>>> >> >>>>> 3. Find current vm conf, edit it to boot from a rescue iso image of >> your preference or from net/PXE etc., and start the vm with '--vm-conf' >> pointing to your edited file. >> >>>>> >> >>>>> 4. Connect a console (hosted-engine --console, or 'virsh console', >> or use '--add-console-password' and remote viewer, if needed) >> >>>>> >> >>>>> 5. Clean the disk and install the OS, oVirt, etc. >> >>>>> >> >>>>> 6. Copy your backup into the vm and restore with engine-backup >> >>>>> >> >>>>> 7. Then cleanly stop the machine, exit global maint, and let HA >> start it (or start it yourself with --vm-start). >> >>>>> >> >>>>> At the time, we had a bug [1] to document this. The result is [2]. >> It does not detail how to boot/reinstall os/etc., only restore (if e.g. db >> is dead but fs is ok). >> >>>>> For something somewhat similar to what you want, see also [3], >> which uses guestfish. Might be useful, depending on how badly your disk is >> corrupted. >> >>>> >> >>>> I went with the guestfish approach. It has fixed some fs issues and >> now the yum etc seem fine apart from postgres. >> >>>> I had tried previously to uninstall/install packages so I ended >> installing them again with yum install ovirt\*setup\*. >> >>>> Now I think I have to run engine-setup but I get the error: >> >>>> >> >>>> Failed to execute stage 'Environment setup': Cannot connect to >> Engine database using existing credentials: engine@localhost:5432 >> >>> >> >>> Seems that I need to have psql running to be able to run >> engine-backup --mode=restore. Are there any steps how one could manually >> prepare pgsql for ovirt so as to attempt restoration? >> >> >> >> >> >> Replying again, also to conclude this part of your episode: Generally >> speaking, that's not needed. restore --provision-all-databases should do >> that for you. >> > >> > Seems that when pgsql is down nothing can be done. You need at least >> pgsql up and running (e clean state will do) so as to be able to proceed >> with restoration. >> >> Do you still have logs from this? Both engine-backup's (default to >> /var/log/ovirt-engine-backup/something if you do not pass --log) and >> ovirt-engine-provisiondb which it runs (at >> /var/log/ovirt-engine/setup). >> > I was using --provision-all-databases flag when trying to restore. I might > retest to double check. When the pgsql was down, I was getting: > > 2020-11-19 22:06:35 4947: Start of engine-backup mode restore scope all > file /var/backup/daily.0/engine-backup.gz > 2020-11-19 22:06:35 4947: OUTPUT: Start of engine-backup with mode > 'restore' > 2020-11-19 22:06:35 4947: OUTPUT: scope: all > 2020-11-19 22:06:35 4947: OUTPUT: archive file: > /var/backup/daily.0/engine-backup.gz > 2020-11-19 22:06:35 4947: OUTPUT: log file: restore.log > 2020-11-19 22:06:35 4947: Setting scl env for rh-postgresql10 > 2020-11-19 22:06:35 4947: OUTPUT: Preparing to restore: > 2020-11-19 22:06:35 4947: OUTPUT: - Unpacking file > '/var/backup/daily.0/engine-backup.gz' > 2020-11-19 22:06:35 4947: Opening tarball > /var/backup/daily.0/engine-backup.gz to /tmp/engine-backup.63eeNqt4NH > 2020-11-19 22:06:35 4947: Verifying hash > 2020-11-19 22:06:35 4947: Verifying version > 2020-11-19 22:06:35 4947: Reading config > 2020-11-19 22:06:35 4947: OUTPUT: Restoring: > 2020-11-19 22:06:35 4947: OUTPUT: - Files > 2020-11-19 22:06:35 4947: Restoring files > 2020-11-19 22:06:36 4947: Reloading configuration > After this point, if it was restoring any db, it should have had: Provisioning PostgreSQL users/databases: So either you didn't pass any '--provision' option, or your backup did not include any db dump. Perhaps you run your backups with '--scope=files'? ( Now pushed this patch, to log what we find: https://gerrit.ovirt.org/112338 ) > 2020-11-19 22:06:36 4947: Generating pgpass > 2020-11-19 22:06:36 4947: Verifying connection > 2020-11-19 22:06:36 4947: pg_cmd running: psql -w -U engine -h localhost > -p 5432 engine -c select 1 > psql: FATAL: Ident authentication failed for user "engine" > 2020-11-19 22:06:36 4947: FATAL: Can't connect to database 'engine'. > Please see '/usr/bin/engine-backup --help'. > > >> Not sure what you mean in "a clean state will do". If you just install >> PG, it is not enabled by default, so is not "up and running". >> > I mean pgsql re-installed and the data stored cleaned as below: > > rm -rf /var/opt/rh/rh-postgresql10/lib/pgsql/data/* > /opt/rh/rh-postgresql10/root/usr/bin/postgresql-setup --initdb > systemctl restart rh-postgresql10-postgresql.service > >> >> Generally speaking: >> >> If you never started/inited PG (e.g. on a clean machine), restore, >> with --provision-all-databases, does this for you. Are you sure you >> passed this? >> > I am pretty sure I used that flag but might be able to repeat for > testing. > Thanks. > >> If you did, and created DB/user with the same name it wants to restore >> to, but left the DB empty, it will use it. >> >> If you populated the DB, it will fail with a suitable error message. >> > Confirmed. When I created the DB and users it was failing. So I cleaned > everything, strtied pgsql and left the tool to do its job. > If you created both user and db, but left the db empty, it should have been used as-is. Only if it has content we fail. Best regards, > >> These are the states that are intended to be supported. >> >> Anything else might break it in other ways. >> >> >> >> >> >> >> I replied to all your interim emails in private, since you replied in >> private. >> > >> > Did not notice I was replying in private :) >> >> NP :-) >> >> >> >> >> >> >> Thanks for the final message to the list. >> >> >> >> It would be nice if you send another summary of the main obstacles you >> ran into, what worked and didn't work, and especially what ideas you can >> think of to improve the code/doc for the next time something similar >> happens (also to you :-) ). >> >> >> >> If you feel like that, and have time, it sounds like a nice >> opportunity for a blog post :-) (I know I (almost?) never wrote any myself, >> sorry, but I like reading them - and they are much more approachable and >> useful, over the long run, compared to just posting to the list). >> > >> > Noted. Will check to put this in a blog. Generally the missing part >> from the docs was that one cannot proceed with the restoration if pgsql is >> not able to start. So I had to clean re-install pgsql and initialize its >> data store before proceeding with the restoration. >> >> Well, I'd definitely not want a blog post saying you must manually >> init PG - if you indeed must, that's a bug, so I'd rather fix it >> first. >> > Noted. > >> >> Thanks and best regards, >> >> >> >> >> >> >> Best regards, >> >> >> >>>> >> >>>> >> >>>> So I guess I need to follow [2]. What do you think? >> >>>> >> >>>>> >> >>>>> How did you run into a split brain? There is a lock on the shared >> storage that should prevent this. >> >>>>> >> >>>>> Good luck and best regards, >> >>>>> >> >>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1482710 >> >>>>> [2] >> https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_Self-Hosted_Engine >> >>>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4 >> >>>>> -- >> >>>>> Didi >> >> >> >> >> >> >> >> -- >> >> Didi >> > >> > _______________________________________________ >> > Users mailing list -- users@ovirt.org >> > To unsubscribe send an email to users-le...@ovirt.org >> > Privacy Statement: https://www.ovirt.org/privacy-policy.html >> > oVirt Code of Conduct: >> https://www.ovirt.org/community/about/community-guidelines/ >> > List Archives: >> https://lists.ovirt.org/archives/list/users@ovirt.org/message/6QZ4OKZTHPE7LLOHNKGJC2HMMBK662GN/ >> >> >> >> -- >> Didi >> >> -- Didi
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZH5PPOIGN7ALY66F3SQCC37VD7KAU4J6/