[ovirt-users] Re: Fix corrupt self-hosted engine

Yedidyah Bar David Tue, 24 Nov 2020 03:07:12 -0800

On Tue, Nov 24, 2020 at 12:38 PM Alex K <rightkickt...@gmail.com> wrote:


>
>
> On Mon, Nov 23, 2020 at 10:09 AM Yedidyah Bar David <d...@redhat.com>
> wrote:
>
>> On Mon, Nov 23, 2020 at 9:54 AM Alex K <rightkickt...@gmail.com> wrote:
>> >
>> >
>> >
>> > On Sun, Nov 22, 2020 at 8:57 AM Yedidyah Bar David <d...@redhat.com>
>> wrote:
>> >>
>> >> On Thu, Nov 19, 2020 at 9:43 PM Alex K <rightkickt...@gmail.com>
>> wrote:
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Nov 19, 2020 at 5:31 PM Alex K <rightkickt...@gmail.com>
>> wrote:
>> >>>>
>> >>>> Hi Didi,
>> >>>>
>> >>>> On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David <d...@redhat.com>
>> wrote:
>> >>>>>
>> >>>>> On Thu, Nov 19, 2020 at 4:37 PM Alex K <rightkickt...@gmail.com>
>> wrote:
>> >>>>>>
>> >>>>>> Hi all,
>> >>>>>>
>> >>>>>> I have a corrupt self-hosted engine (with several file system
>> errors, postgres not able to start) and thus it does not give access to the
>> web UI. This happened following an unlucky split brain resolution (I am
>> running 2 nodes). The two hosts are running VMs also which I would like to
>> keep running as they are needed.
>> >>>>>>
>> >>>>>> When trying to boot into rescue mode (using
>> systemd.unit=emergency.target boot parameter) I get a cursor and nothing
>> else.
>> >>>>>
>> >>>>>
>> >>>>> This means that more than just the DB is corrupt...
>> >>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> I have backups of engine files with scope all (using the
>> engine-backup tool).
>> >>>>>> What is the best approach to try and fix the engine or redeploy.
>> >>>>>
>> >>>>>
>> >>>>> If you are careful, and know what you are doing, you can try
>> something like the following. I am not giving many details, hopefully you
>> can find on the net tutorials about how to use the things I suggest:
>> >>>>>
>> >>>>> 1. Move to global maintenance
>> >>>>>
>> >>>>> 2. Stop the current dead vm (if needed)
>> >>>>>
>> >>>>> 3. Find current vm conf, edit it to boot from a rescue iso image of
>> your preference or from net/PXE etc., and start the vm with '--vm-conf'
>> pointing to your edited file.
>> >>>>>
>> >>>>> 4. Connect a console (hosted-engine --console, or 'virsh console',
>> or use '--add-console-password' and remote viewer, if needed)
>> >>>>>
>> >>>>> 5. Clean the disk and install the OS, oVirt, etc.
>> >>>>>
>> >>>>> 6. Copy your backup into the vm and restore with engine-backup
>> >>>>>
>> >>>>> 7. Then cleanly stop the machine, exit global maint, and let HA
>> start it (or start it yourself with --vm-start).
>> >>>>>
>> >>>>> At the time, we had a bug [1] to document this. The result is [2].
>> It does not detail how to boot/reinstall os/etc., only restore (if e.g. db
>> is dead but fs is ok).
>> >>>>> For something somewhat similar to what you want, see also [3],
>> which uses guestfish. Might be useful, depending on how badly your disk is
>> corrupted.
>> >>>>
>> >>>> I went with the guestfish approach. It has fixed some fs issues and
>> now the yum etc seem fine apart from postgres.
>> >>>> I had tried previously to uninstall/install packages so I ended
>> installing them again with yum install ovirt\*setup\*.
>> >>>> Now I think I have to run engine-setup but I get the error:
>> >>>>
>> >>>>  Failed to execute stage 'Environment setup': Cannot connect to
>> Engine database using existing credentials: engine@localhost:5432
>> >>>
>> >>> Seems that I need to have psql running to be able to run
>> engine-backup --mode=restore. Are there any steps how one could manually
>> prepare pgsql for ovirt so as to attempt restoration?
>> >>
>> >>
>> >> Replying again, also to conclude this part of your episode: Generally
>> speaking, that's not needed. restore --provision-all-databases should do
>> that for you.
>> >
>> > Seems that when pgsql is down nothing can be done. You need at least
>> pgsql up and running (e clean state will do) so as to be able to proceed
>> with restoration.
>>
>> Do you still have logs from this? Both engine-backup's (default to
>> /var/log/ovirt-engine-backup/something if you do not pass --log) and
>> ovirt-engine-provisiondb which it runs (at
>> /var/log/ovirt-engine/setup).
>>
> I was using --provision-all-databases flag when trying to restore. I might
> retest to double check. When the pgsql was down, I was getting:
>
> 2020-11-19 22:06:35 4947: Start of engine-backup mode restore scope all
> file /var/backup/daily.0/engine-backup.gz
> 2020-11-19 22:06:35 4947: OUTPUT: Start of engine-backup with mode
> 'restore'
> 2020-11-19 22:06:35 4947: OUTPUT: scope: all
> 2020-11-19 22:06:35 4947: OUTPUT: archive file:
> /var/backup/daily.0/engine-backup.gz
> 2020-11-19 22:06:35 4947: OUTPUT: log file: restore.log
> 2020-11-19 22:06:35 4947: Setting scl env for rh-postgresql10
> 2020-11-19 22:06:35 4947: OUTPUT: Preparing to restore:
> 2020-11-19 22:06:35 4947: OUTPUT: - Unpacking file
> '/var/backup/daily.0/engine-backup.gz'
> 2020-11-19 22:06:35 4947: Opening tarball
> /var/backup/daily.0/engine-backup.gz to /tmp/engine-backup.63eeNqt4NH
> 2020-11-19 22:06:35 4947: Verifying hash
> 2020-11-19 22:06:35 4947: Verifying version
> 2020-11-19 22:06:35 4947: Reading config
> 2020-11-19 22:06:35 4947: OUTPUT: Restoring:
> 2020-11-19 22:06:35 4947: OUTPUT: - Files
> 2020-11-19 22:06:35 4947: Restoring files
> 2020-11-19 22:06:36 4947: Reloading configuration
>

After this point, if it was restoring any db, it should have had:

    Provisioning PostgreSQL users/databases:

So either you didn't pass any '--provision' option, or your backup did not
include any db dump.
Perhaps you run your backups with '--scope=files'?

( Now pushed this patch, to log what we find:
https://gerrit.ovirt.org/112338 )


> 2020-11-19 22:06:36 4947: Generating pgpass
> 2020-11-19 22:06:36 4947: Verifying connection
> 2020-11-19 22:06:36 4947: pg_cmd running: psql -w -U engine -h localhost
> -p 5432  engine -c select 1
> psql: FATAL:  Ident authentication failed for user "engine"
> 2020-11-19 22:06:36 4947: FATAL: Can't connect to database 'engine'.
> Please see '/usr/bin/engine-backup --help'.
>
>
>> Not sure what you mean in "a clean state will do". If you just install
>> PG, it is not enabled by default, so is not "up and running".
>>
> I mean pgsql re-installed and the data stored cleaned as below:
>
> rm -rf /var/opt/rh/rh-postgresql10/lib/pgsql/data/*
> /opt/rh/rh-postgresql10/root/usr/bin/postgresql-setup --initdb
> systemctl restart rh-postgresql10-postgresql.service
>
>>
>> Generally speaking:
>>
>> If you never started/inited PG (e.g. on a clean machine), restore,
>> with --provision-all-databases, does this for you. Are you sure you
>> passed this?
>>
> I am pretty sure I used that flag but might be able to repeat for
> testing.
>

Thanks.


>
>> If you did, and created DB/user with the same name it wants to restore
>> to, but left the DB empty, it will use it.
>>
>> If you populated the DB, it will fail with a suitable error message.
>>
> Confirmed. When I created the DB and users it was failing. So I cleaned
> everything, strtied pgsql and left the tool to do its job.
>

If you created both user and db, but left the db empty, it should have been
used as-is. Only if it has content we fail.

Best regards,


>
>> These are the states that are intended to be supported.
>>
>> Anything else might break it in other ways.
>>
>> >>
>> >>
>> >> I replied to all your interim emails in private, since you replied in
>> private.
>> >
>> > Did not notice I was replying in private :)
>>
>> NP :-)
>>
>> >>
>> >>
>> >> Thanks for the final message to the list.
>> >>
>> >> It would be nice if you send another summary of the main obstacles you
>> ran into, what worked and didn't work, and especially what ideas you can
>> think of to improve the code/doc for the next time something similar
>> happens (also to you :-) ).
>> >>
>> >> If you feel like that, and have time, it sounds like a nice
>> opportunity for a blog post :-) (I know I (almost?) never wrote any myself,
>> sorry, but I like reading them - and they are much more approachable and
>> useful, over the long run, compared to just posting to the list).
>> >
>> > Noted. Will check to put this in a blog.  Generally the missing part
>> from the docs was that one cannot proceed with the restoration if pgsql is
>> not able to start. So I had to clean re-install pgsql and initialize its
>> data store before proceeding with the restoration.
>>
>> Well, I'd definitely not want a blog post saying you must manually
>> init PG - if you indeed must, that's a bug, so I'd rather fix it
>> first.
>>
> Noted.
>
>>
>> Thanks and best regards,
>>
>> >>
>> >>
>> >> Best regards,
>> >>
>> >>>>
>> >>>>
>> >>>> So I guess I need to follow [2]. What do you think?
>> >>>>
>> >>>>>
>> >>>>> How did you run into a split brain? There is a lock on the shared
>> storage that should prevent this.
>> >>>>>
>> >>>>> Good luck and best regards,
>> >>>>>
>> >>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1482710
>> >>>>> [2]
>> https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_Self-Hosted_Engine
>> >>>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4
>> >>>>> --
>> >>>>> Didi
>> >>
>> >>
>> >>
>> >> --
>> >> Didi
>> >
>> > _______________________________________________
>> > Users mailing list -- users@ovirt.org
>> > To unsubscribe send an email to users-le...@ovirt.org
>> > Privacy Statement: https://www.ovirt.org/privacy-policy.html
>> > oVirt Code of Conduct:
>> https://www.ovirt.org/community/about/community-guidelines/
>> > List Archives:
>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/6QZ4OKZTHPE7LLOHNKGJC2HMMBK662GN/
>>
>>
>>
>> --
>> Didi
>>
>>

-- 
Didi

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZH5PPOIGN7ALY66F3SQCC37VD7KAU4J6/

[ovirt-users] Re: Fix corrupt self-hosted engine

Reply via email to