[ovirt-users] Re: Fix corrupt self-hosted engine

2020-11-24 Thread Yedidyah Bar David
On Tue, Nov 24, 2020 at 12:38 PM Alex K  wrote:

>
>
> On Mon, Nov 23, 2020 at 10:09 AM Yedidyah Bar David 
> wrote:
>
>> On Mon, Nov 23, 2020 at 9:54 AM Alex K  wrote:
>> >
>> >
>> >
>> > On Sun, Nov 22, 2020 at 8:57 AM Yedidyah Bar David 
>> wrote:
>> >>
>> >> On Thu, Nov 19, 2020 at 9:43 PM Alex K 
>> wrote:
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Nov 19, 2020 at 5:31 PM Alex K 
>> wrote:
>> 
>>  Hi Didi,
>> 
>>  On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David 
>> wrote:
>> >
>> > On Thu, Nov 19, 2020 at 4:37 PM Alex K 
>> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I have a corrupt self-hosted engine (with several file system
>> errors, postgres not able to start) and thus it does not give access to the
>> web UI. This happened following an unlucky split brain resolution (I am
>> running 2 nodes). The two hosts are running VMs also which I would like to
>> keep running as they are needed.
>> >>
>> >> When trying to boot into rescue mode (using
>> systemd.unit=emergency.target boot parameter) I get a cursor and nothing
>> else.
>> >
>> >
>> > This means that more than just the DB is corrupt...
>> >
>> >>
>> >>
>> >> I have backups of engine files with scope all (using the
>> engine-backup tool).
>> >> What is the best approach to try and fix the engine or redeploy.
>> >
>> >
>> > If you are careful, and know what you are doing, you can try
>> something like the following. I am not giving many details, hopefully you
>> can find on the net tutorials about how to use the things I suggest:
>> >
>> > 1. Move to global maintenance
>> >
>> > 2. Stop the current dead vm (if needed)
>> >
>> > 3. Find current vm conf, edit it to boot from a rescue iso image of
>> your preference or from net/PXE etc., and start the vm with '--vm-conf'
>> pointing to your edited file.
>> >
>> > 4. Connect a console (hosted-engine --console, or 'virsh console',
>> or use '--add-console-password' and remote viewer, if needed)
>> >
>> > 5. Clean the disk and install the OS, oVirt, etc.
>> >
>> > 6. Copy your backup into the vm and restore with engine-backup
>> >
>> > 7. Then cleanly stop the machine, exit global maint, and let HA
>> start it (or start it yourself with --vm-start).
>> >
>> > At the time, we had a bug [1] to document this. The result is [2].
>> It does not detail how to boot/reinstall os/etc., only restore (if e.g. db
>> is dead but fs is ok).
>> > For something somewhat similar to what you want, see also [3],
>> which uses guestfish. Might be useful, depending on how badly your disk is
>> corrupted.
>> 
>>  I went with the guestfish approach. It has fixed some fs issues and
>> now the yum etc seem fine apart from postgres.
>>  I had tried previously to uninstall/install packages so I ended
>> installing them again with yum install ovirt\*setup\*.
>>  Now I think I have to run engine-setup but I get the error:
>> 
>>   Failed to execute stage 'Environment setup': Cannot connect to
>> Engine database using existing credentials: engine@localhost:5432
>> >>>
>> >>> Seems that I need to have psql running to be able to run
>> engine-backup --mode=restore. Are there any steps how one could manually
>> prepare pgsql for ovirt so as to attempt restoration?
>> >>
>> >>
>> >> Replying again, also to conclude this part of your episode: Generally
>> speaking, that's not needed. restore --provision-all-databases should do
>> that for you.
>> >
>> > Seems that when pgsql is down nothing can be done. You need at least
>> pgsql up and running (e clean state will do) so as to be able to proceed
>> with restoration.
>>
>> Do you still have logs from this? Both engine-backup's (default to
>> /var/log/ovirt-engine-backup/something if you do not pass --log) and
>> ovirt-engine-provisiondb which it runs (at
>> /var/log/ovirt-engine/setup).
>>
> I was using --provision-all-databases flag when trying to restore. I might
> retest to double check. When the pgsql was down, I was getting:
>
> 2020-11-19 22:06:35 4947: Start of engine-backup mode restore scope all
> file /var/backup/daily.0/engine-backup.gz
> 2020-11-19 22:06:35 4947: OUTPUT: Start of engine-backup with mode
> 'restore'
> 2020-11-19 22:06:35 4947: OUTPUT: scope: all
> 2020-11-19 22:06:35 4947: OUTPUT: archive file:
> /var/backup/daily.0/engine-backup.gz
> 2020-11-19 22:06:35 4947: OUTPUT: log file: restore.log
> 2020-11-19 22:06:35 4947: Setting scl env for rh-postgresql10
> 2020-11-19 22:06:35 4947: OUTPUT: Preparing to restore:
> 2020-11-19 22:06:35 4947: OUTPUT: - Unpacking file
> '/var/backup/daily.0/engine-backup.gz'
> 2020-11-19 22:06:35 4947: Opening tarball
> /var/backup/daily.0/engine-backup.gz to /tmp/engine-backup.63eeNqt4NH
> 2020-11-19 22:06:35 4947: Verifying hash
> 2020-11-19 22:06:35 4947: Verifying version
> 2020-11-19 22:06:35 4947: Reading config
> 2020-11-19 22:06:35 4947: OUTPUT: 

[ovirt-users] Re: Fix corrupt self-hosted engine

2020-11-24 Thread Alex K
On Mon, Nov 23, 2020 at 10:09 AM Yedidyah Bar David  wrote:

> On Mon, Nov 23, 2020 at 9:54 AM Alex K  wrote:
> >
> >
> >
> > On Sun, Nov 22, 2020 at 8:57 AM Yedidyah Bar David 
> wrote:
> >>
> >> On Thu, Nov 19, 2020 at 9:43 PM Alex K  wrote:
> >>>
> >>>
> >>>
> >>> On Thu, Nov 19, 2020 at 5:31 PM Alex K 
> wrote:
> 
>  Hi Didi,
> 
>  On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David 
> wrote:
> >
> > On Thu, Nov 19, 2020 at 4:37 PM Alex K 
> wrote:
> >>
> >> Hi all,
> >>
> >> I have a corrupt self-hosted engine (with several file system
> errors, postgres not able to start) and thus it does not give access to the
> web UI. This happened following an unlucky split brain resolution (I am
> running 2 nodes). The two hosts are running VMs also which I would like to
> keep running as they are needed.
> >>
> >> When trying to boot into rescue mode (using
> systemd.unit=emergency.target boot parameter) I get a cursor and nothing
> else.
> >
> >
> > This means that more than just the DB is corrupt...
> >
> >>
> >>
> >> I have backups of engine files with scope all (using the
> engine-backup tool).
> >> What is the best approach to try and fix the engine or redeploy.
> >
> >
> > If you are careful, and know what you are doing, you can try
> something like the following. I am not giving many details, hopefully you
> can find on the net tutorials about how to use the things I suggest:
> >
> > 1. Move to global maintenance
> >
> > 2. Stop the current dead vm (if needed)
> >
> > 3. Find current vm conf, edit it to boot from a rescue iso image of
> your preference or from net/PXE etc., and start the vm with '--vm-conf'
> pointing to your edited file.
> >
> > 4. Connect a console (hosted-engine --console, or 'virsh console',
> or use '--add-console-password' and remote viewer, if needed)
> >
> > 5. Clean the disk and install the OS, oVirt, etc.
> >
> > 6. Copy your backup into the vm and restore with engine-backup
> >
> > 7. Then cleanly stop the machine, exit global maint, and let HA
> start it (or start it yourself with --vm-start).
> >
> > At the time, we had a bug [1] to document this. The result is [2].
> It does not detail how to boot/reinstall os/etc., only restore (if e.g. db
> is dead but fs is ok).
> > For something somewhat similar to what you want, see also [3], which
> uses guestfish. Might be useful, depending on how badly your disk is
> corrupted.
> 
>  I went with the guestfish approach. It has fixed some fs issues and
> now the yum etc seem fine apart from postgres.
>  I had tried previously to uninstall/install packages so I ended
> installing them again with yum install ovirt\*setup\*.
>  Now I think I have to run engine-setup but I get the error:
> 
>   Failed to execute stage 'Environment setup': Cannot connect to
> Engine database using existing credentials: engine@localhost:5432
> >>>
> >>> Seems that I need to have psql running to be able to run engine-backup
> --mode=restore. Are there any steps how one could manually prepare pgsql
> for ovirt so as to attempt restoration?
> >>
> >>
> >> Replying again, also to conclude this part of your episode: Generally
> speaking, that's not needed. restore --provision-all-databases should do
> that for you.
> >
> > Seems that when pgsql is down nothing can be done. You need at least
> pgsql up and running (e clean state will do) so as to be able to proceed
> with restoration.
>
> Do you still have logs from this? Both engine-backup's (default to
> /var/log/ovirt-engine-backup/something if you do not pass --log) and
> ovirt-engine-provisiondb which it runs (at
> /var/log/ovirt-engine/setup).
>
I was using --provision-all-databases flag when trying to restore. I might
retest to double check. When the pgsql was down, I was getting:

2020-11-19 22:06:35 4947: Start of engine-backup mode restore scope all
file /var/backup/daily.0/engine-backup.gz
2020-11-19 22:06:35 4947: OUTPUT: Start of engine-backup with mode 'restore'
2020-11-19 22:06:35 4947: OUTPUT: scope: all
2020-11-19 22:06:35 4947: OUTPUT: archive file:
/var/backup/daily.0/engine-backup.gz
2020-11-19 22:06:35 4947: OUTPUT: log file: restore.log
2020-11-19 22:06:35 4947: Setting scl env for rh-postgresql10
2020-11-19 22:06:35 4947: OUTPUT: Preparing to restore:
2020-11-19 22:06:35 4947: OUTPUT: - Unpacking file
'/var/backup/daily.0/engine-backup.gz'
2020-11-19 22:06:35 4947: Opening tarball
/var/backup/daily.0/engine-backup.gz to /tmp/engine-backup.63eeNqt4NH
2020-11-19 22:06:35 4947: Verifying hash
2020-11-19 22:06:35 4947: Verifying version
2020-11-19 22:06:35 4947: Reading config
2020-11-19 22:06:35 4947: OUTPUT: Restoring:
2020-11-19 22:06:35 4947: OUTPUT: - Files
2020-11-19 22:06:35 4947: Restoring files
2020-11-19 22:06:36 4947: Reloading configuration
2020-11-19 22:06:36 4947: Generating pgpass
2020-11-19 

[ovirt-users] Re: Fix corrupt self-hosted engine

2020-11-23 Thread Yedidyah Bar David
On Mon, Nov 23, 2020 at 9:54 AM Alex K  wrote:
>
>
>
> On Sun, Nov 22, 2020 at 8:57 AM Yedidyah Bar David  wrote:
>>
>> On Thu, Nov 19, 2020 at 9:43 PM Alex K  wrote:
>>>
>>>
>>>
>>> On Thu, Nov 19, 2020 at 5:31 PM Alex K  wrote:

 Hi Didi,

 On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David  wrote:
>
> On Thu, Nov 19, 2020 at 4:37 PM Alex K  wrote:
>>
>> Hi all,
>>
>> I have a corrupt self-hosted engine (with several file system errors, 
>> postgres not able to start) and thus it does not give access to the web 
>> UI. This happened following an unlucky split brain resolution (I am 
>> running 2 nodes). The two hosts are running VMs also which I would like 
>> to keep running as they are needed.
>>
>> When trying to boot into rescue mode (using 
>> systemd.unit=emergency.target boot parameter) I get a cursor and nothing 
>> else.
>
>
> This means that more than just the DB is corrupt...
>
>>
>>
>> I have backups of engine files with scope all (using the engine-backup 
>> tool).
>> What is the best approach to try and fix the engine or redeploy.
>
>
> If you are careful, and know what you are doing, you can try something 
> like the following. I am not giving many details, hopefully you can find 
> on the net tutorials about how to use the things I suggest:
>
> 1. Move to global maintenance
>
> 2. Stop the current dead vm (if needed)
>
> 3. Find current vm conf, edit it to boot from a rescue iso image of your 
> preference or from net/PXE etc., and start the vm with '--vm-conf' 
> pointing to your edited file.
>
> 4. Connect a console (hosted-engine --console, or 'virsh console', or use 
> '--add-console-password' and remote viewer, if needed)
>
> 5. Clean the disk and install the OS, oVirt, etc.
>
> 6. Copy your backup into the vm and restore with engine-backup
>
> 7. Then cleanly stop the machine, exit global maint, and let HA start it 
> (or start it yourself with --vm-start).
>
> At the time, we had a bug [1] to document this. The result is [2]. It 
> does not detail how to boot/reinstall os/etc., only restore (if e.g. db 
> is dead but fs is ok).
> For something somewhat similar to what you want, see also [3], which uses 
> guestfish. Might be useful, depending on how badly your disk is corrupted.

 I went with the guestfish approach. It has fixed some fs issues and now 
 the yum etc seem fine apart from postgres.
 I had tried previously to uninstall/install packages so I ended installing 
 them again with yum install ovirt\*setup\*.
 Now I think I have to run engine-setup but I get the error:

  Failed to execute stage 'Environment setup': Cannot connect to Engine 
 database using existing credentials: engine@localhost:5432
>>>
>>> Seems that I need to have psql running to be able to run engine-backup 
>>> --mode=restore. Are there any steps how one could manually prepare pgsql 
>>> for ovirt so as to attempt restoration?
>>
>>
>> Replying again, also to conclude this part of your episode: Generally 
>> speaking, that's not needed. restore --provision-all-databases should do 
>> that for you.
>
> Seems that when pgsql is down nothing can be done. You need at least pgsql up 
> and running (e clean state will do) so as to be able to proceed with 
> restoration.

Do you still have logs from this? Both engine-backup's (default to
/var/log/ovirt-engine-backup/something if you do not pass --log) and
ovirt-engine-provisiondb which it runs (at
/var/log/ovirt-engine/setup).

Not sure what you mean in "a clean state will do". If you just install
PG, it is not enabled by default, so is not "up and running".

Generally speaking:

If you never started/inited PG (e.g. on a clean machine), restore,
with --provision-all-databases, does this for you. Are you sure you
passed this?

If you did, and created DB/user with the same name it wants to restore
to, but left the DB empty, it will use it.

If you populated the DB, it will fail with a suitable error message.

These are the states that are intended to be supported.

Anything else might break it in other ways.

>>
>>
>> I replied to all your interim emails in private, since you replied in 
>> private.
>
> Did not notice I was replying in private :)

NP :-)

>>
>>
>> Thanks for the final message to the list.
>>
>> It would be nice if you send another summary of the main obstacles you ran 
>> into, what worked and didn't work, and especially what ideas you can think 
>> of to improve the code/doc for the next time something similar happens (also 
>> to you :-) ).
>>
>> If you feel like that, and have time, it sounds like a nice opportunity for 
>> a blog post :-) (I know I (almost?) never wrote any myself, sorry, but I 
>> like reading them - and they are much more approachable and useful, over the 
>> long 

[ovirt-users] Re: Fix corrupt self-hosted engine

2020-11-22 Thread Alex K
On Sun, Nov 22, 2020 at 8:57 AM Yedidyah Bar David  wrote:

> On Thu, Nov 19, 2020 at 9:43 PM Alex K  wrote:
>
>>
>>
>> On Thu, Nov 19, 2020 at 5:31 PM Alex K  wrote:
>>
>>> Hi Didi,
>>>
>>> On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David 
>>> wrote:
>>>
 On Thu, Nov 19, 2020 at 4:37 PM Alex K  wrote:

> Hi all,
>
> I have a corrupt self-hosted engine (with several file system errors,
> postgres not able to start) and thus it does not give access to the web 
> UI.
> This happened following an unlucky split brain resolution (I am running 2
> nodes). The two hosts are running VMs also which I would like to keep
> running as they are needed.
>
> When trying to boot into rescue mode (using
> systemd.unit=emergency.target boot parameter) I get a cursor and nothing
> else.
>

 This means that more than just the DB is corrupt...


>
> I have backups of engine files with scope all (using the engine-backup
> tool).
> What is the best approach to try and fix the engine or redeploy.
>

 If you are careful, and know what you are doing, you can try something
 like the following. I am not giving many details, hopefully you can find on
 the net tutorials about how to use the things I suggest:

 1. Move to global maintenance

 2. Stop the current dead vm (if needed)

 3. Find current vm conf, edit it to boot from a rescue iso image of
 your preference or from net/PXE etc., and start the vm with '--vm-conf'
 pointing to your edited file.

 4. Connect a console (hosted-engine --console, or 'virsh console', or
 use '--add-console-password' and remote viewer, if needed)

 5. Clean the disk and install the OS, oVirt, etc.

 6. Copy your backup into the vm and restore with engine-backup

 7. Then cleanly stop the machine, exit global maint, and let HA start
 it (or start it yourself with --vm-start).

 At the time, we had a bug [1] to document this. The result is [2]. It
 does not detail how to boot/reinstall os/etc., only restore (if e.g. db is
 dead but fs is ok).
 For something somewhat similar to what you want, see also [3], which
 uses guestfish. Might be useful, depending on how badly your disk is
 corrupted.

>>> I went with the guestfish approach. It has fixed some fs issues and now
>>> the yum etc seem fine apart from postgres.
>>> I had tried previously to uninstall/install packages so I ended
>>> installing them again with yum install ovirt\*setup\*.
>>> Now I think I have to run engine-setup but I get the error:
>>>
>>>  Failed to execute stage 'Environment setup': Cannot connect to Engine
>>> database using existing credentials: engine@localhost:5432
>>>
>> Seems that I need to have psql running to be able to run engine-backup
>> --mode=restore. Are there any steps how one could manually prepare pgsql
>> for ovirt so as to attempt restoration?
>>
>
> Replying again, also to conclude this part of your episode: Generally
> speaking, that's not needed. restore --provision-all-databases should do
> that for you.
>
Seems that when pgsql is down nothing can be done. You need at least pgsql
up and running (e clean state will do) so as to be able to proceed with
restoration.

>
> I replied to all your interim emails in private, since you replied in
> private.
>
Did not notice I was replying in private :)

>
> Thanks for the final message to the list.
>
> It would be nice if you send another summary of the main obstacles you ran
> into, what worked and didn't work, and especially what ideas you can think
> of to improve the code/doc for the next time something similar happens
> (also to you :-) ).
>
> If you feel like that, and have time, it sounds like a nice opportunity
> for a blog post :-) (I know I (almost?) never wrote any myself, sorry, but
> I like reading them - and they are much more approachable and useful, over
> the long run, compared to just posting to the list).
>
Noted. Will check to put this in a blog.  Generally the missing part from
the docs was that one cannot proceed with the restoration if pgsql is not
able to start. So I had to clean re-install pgsql and initialize its data
store before proceeding with the restoration.

>
> Best regards,
>
>
>>
>>> So I guess I need to follow [2]. What do you think?
>>>
>>>
 How did you run into a split brain? There is a lock on the shared
 storage that should prevent this.

 Good luck and best regards,

 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1482710
 [2]
 https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_Self-Hosted_Engine
 [3] https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4
 --
 Didi

>>>
>
> --
> Didi
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: 

[ovirt-users] Re: Fix corrupt self-hosted engine

2020-11-21 Thread Yedidyah Bar David
On Thu, Nov 19, 2020 at 11:33 PM Alex K  wrote:

> For the records,
>
> After having fixed the major fs issues with guestfish and since the DB was
> not starting up, I removed everything from DB data dir and recreated it as
> below:
>
> rm -rf /var/opt/rh/rh-postgresql10/lib/pgsql/data/*
> /opt/rh/rh-postgresql10/root/usr/bin/postgresql-setup --initdb
> systemctl restart rh-postgresql10-postgresql.service
>

Generally speaking, this should not be needed. --provision-all-databases
should do this for you.


>
> Then proceeded with the restoration, where I requested to provision all
> missing databases:
> engine-backup --mode=restore --file=engine-backup.gz
> --provision-all-databases \
> --log=restore.log --restore-permissions
>
> Following this, ran engine-setup, as instructed from the restore
> operation.
> Gained engine web access and saw the same running VMs were shown as up
> without issues.
> I only observed one VM not able to start due to illegal volume, but that's
> another story.
>

Glad to hear that, thanks for the report!

Best regards,


>
>
> On Thu, Nov 19, 2020 at 9:42 PM Alex K  wrote:
>
>>
>>
>> On Thu, Nov 19, 2020 at 5:31 PM Alex K  wrote:
>>
>>> Hi Didi,
>>>
>>> On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David 
>>> wrote:
>>>
 On Thu, Nov 19, 2020 at 4:37 PM Alex K  wrote:

> Hi all,
>
> I have a corrupt self-hosted engine (with several file system errors,
> postgres not able to start) and thus it does not give access to the web 
> UI.
> This happened following an unlucky split brain resolution (I am running 2
> nodes). The two hosts are running VMs also which I would like to keep
> running as they are needed.
>
> When trying to boot into rescue mode (using
> systemd.unit=emergency.target boot parameter) I get a cursor and nothing
> else.
>

 This means that more than just the DB is corrupt...


>
> I have backups of engine files with scope all (using the engine-backup
> tool).
> What is the best approach to try and fix the engine or redeploy.
>

 If you are careful, and know what you are doing, you can try something
 like the following. I am not giving many details, hopefully you can find on
 the net tutorials about how to use the things I suggest:

 1. Move to global maintenance

 2. Stop the current dead vm (if needed)

 3. Find current vm conf, edit it to boot from a rescue iso image of
 your preference or from net/PXE etc., and start the vm with '--vm-conf'
 pointing to your edited file.

 4. Connect a console (hosted-engine --console, or 'virsh console', or
 use '--add-console-password' and remote viewer, if needed)

 5. Clean the disk and install the OS, oVirt, etc.

 6. Copy your backup into the vm and restore with engine-backup

 7. Then cleanly stop the machine, exit global maint, and let HA start
 it (or start it yourself with --vm-start).

 At the time, we had a bug [1] to document this. The result is [2]. It
 does not detail how to boot/reinstall os/etc., only restore (if e.g. db is
 dead but fs is ok).
 For something somewhat similar to what you want, see also [3], which
 uses guestfish. Might be useful, depending on how badly your disk is
 corrupted.

>>> I went with the guestfish approach. It has fixed some fs issues and now
>>> the yum etc seem fine apart from postgres.
>>> I had tried previously to uninstall/install packages so I ended
>>> installing them again with yum install ovirt\*setup\*.
>>> Now I think I have to run engine-setup but I get the error:
>>>
>>>  Failed to execute stage 'Environment setup': Cannot connect to Engine
>>> database using existing credentials: engine@localhost:5432
>>>
>> Seems that I need to have psql running to be able to run engine-backup
>> --mode=restore. Are there any steps how one could manually prepare pgsql
>> for ovirt so as to attempt restoration?
>>
>>>
>>> So I guess I need to follow [2]. What do you think?
>>>
>>>
 How did you run into a split brain? There is a lock on the shared
 storage that should prevent this.

 Good luck and best regards,

 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1482710
 [2]
 https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_Self-Hosted_Engine
 [3] https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4
 --
 Didi

>>> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/SU6V565Y5GAZ67FF5MUDGFLEJ2L2LZV7/
>


-- 
Didi
___
Users mailing list -- 

[ovirt-users] Re: Fix corrupt self-hosted engine

2020-11-19 Thread Alex K
For the records,

After having fixed the major fs issues with guestfish and since the DB was
not starting up, I removed everything from DB data dir and recreated it as
below:

rm -rf /var/opt/rh/rh-postgresql10/lib/pgsql/data/*
/opt/rh/rh-postgresql10/root/usr/bin/postgresql-setup --initdb
systemctl restart rh-postgresql10-postgresql.service

Then proceeded with the restoration, where I requested to provision all
missing databases:
engine-backup --mode=restore --file=engine-backup.gz
--provision-all-databases \
--log=restore.log --restore-permissions

Following this, ran engine-setup, as instructed from the restore operation.
Gained engine web access and saw the same running VMs were shown as up
without issues.
I only observed one VM not able to start due to illegal volume, but that's
another story.


On Thu, Nov 19, 2020 at 9:42 PM Alex K  wrote:

>
>
> On Thu, Nov 19, 2020 at 5:31 PM Alex K  wrote:
>
>> Hi Didi,
>>
>> On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David 
>> wrote:
>>
>>> On Thu, Nov 19, 2020 at 4:37 PM Alex K  wrote:
>>>
 Hi all,

 I have a corrupt self-hosted engine (with several file system errors,
 postgres not able to start) and thus it does not give access to the web UI.
 This happened following an unlucky split brain resolution (I am running 2
 nodes). The two hosts are running VMs also which I would like to keep
 running as they are needed.

 When trying to boot into rescue mode (using
 systemd.unit=emergency.target boot parameter) I get a cursor and nothing
 else.

>>>
>>> This means that more than just the DB is corrupt...
>>>
>>>

 I have backups of engine files with scope all (using the engine-backup
 tool).
 What is the best approach to try and fix the engine or redeploy.

>>>
>>> If you are careful, and know what you are doing, you can try something
>>> like the following. I am not giving many details, hopefully you can find on
>>> the net tutorials about how to use the things I suggest:
>>>
>>> 1. Move to global maintenance
>>>
>>> 2. Stop the current dead vm (if needed)
>>>
>>> 3. Find current vm conf, edit it to boot from a rescue iso image of your
>>> preference or from net/PXE etc., and start the vm with '--vm-conf' pointing
>>> to your edited file.
>>>
>>> 4. Connect a console (hosted-engine --console, or 'virsh console', or
>>> use '--add-console-password' and remote viewer, if needed)
>>>
>>> 5. Clean the disk and install the OS, oVirt, etc.
>>>
>>> 6. Copy your backup into the vm and restore with engine-backup
>>>
>>> 7. Then cleanly stop the machine, exit global maint, and let HA start it
>>> (or start it yourself with --vm-start).
>>>
>>> At the time, we had a bug [1] to document this. The result is [2]. It
>>> does not detail how to boot/reinstall os/etc., only restore (if e.g. db is
>>> dead but fs is ok).
>>> For something somewhat similar to what you want, see also [3], which
>>> uses guestfish. Might be useful, depending on how badly your disk is
>>> corrupted.
>>>
>> I went with the guestfish approach. It has fixed some fs issues and now
>> the yum etc seem fine apart from postgres.
>> I had tried previously to uninstall/install packages so I ended
>> installing them again with yum install ovirt\*setup\*.
>> Now I think I have to run engine-setup but I get the error:
>>
>>  Failed to execute stage 'Environment setup': Cannot connect to Engine
>> database using existing credentials: engine@localhost:5432
>>
> Seems that I need to have psql running to be able to run engine-backup
> --mode=restore. Are there any steps how one could manually prepare pgsql
> for ovirt so as to attempt restoration?
>
>>
>> So I guess I need to follow [2]. What do you think?
>>
>>
>>> How did you run into a split brain? There is a lock on the shared
>>> storage that should prevent this.
>>>
>>> Good luck and best regards,
>>>
>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1482710
>>> [2]
>>> https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_Self-Hosted_Engine
>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4
>>> --
>>> Didi
>>>
>>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/SU6V565Y5GAZ67FF5MUDGFLEJ2L2LZV7/


[ovirt-users] Re: Fix corrupt self-hosted engine

2020-11-19 Thread Yedidyah Bar David
On Thu, Nov 19, 2020 at 5:12 PM Yedidyah Bar David  wrote:

> On Thu, Nov 19, 2020 at 4:37 PM Alex K  wrote:
>
>> Hi all,
>>
>> I have a corrupt self-hosted engine (with several file system errors,
>> postgres not able to start) and thus it does not give access to the web UI.
>> This happened following an unlucky split brain resolution (I am running 2
>> nodes). The two hosts are running VMs also which I would like to keep
>> running as they are needed.
>>
>> When trying to boot into rescue mode (using systemd.unit=emergency.target
>> boot parameter) I get a cursor and nothing else.
>>
>
> This means that more than just the DB is corrupt...
>
>
>>
>> I have backups of engine files with scope all (using the engine-backup
>> tool).
>> What is the best approach to try and fix the engine or redeploy.
>>
>
> If you are careful, and know what you are doing, you can try something
> like the following. I am not giving many details, hopefully you can find on
> the net tutorials about how to use the things I suggest:
>
> 1. Move to global maintenance
>
> 2. Stop the current dead vm (if needed)
>
> 3. Find current vm conf, edit it to boot from a rescue iso image of your
> preference or from net/PXE etc., and start the vm with '--vm-conf' pointing
> to your edited file.
>
> 4. Connect a console (hosted-engine --console, or 'virsh console', or use
> '--add-console-password' and remote viewer, if needed)
>
> 5. Clean the disk and install the OS, oVirt, etc.
>
> 6. Copy your backup into the vm and restore with engine-backup
>
> 7. Then cleanly stop the machine, exit global maint, and let HA start it
> (or start it yourself with --vm-start).
>
> At the time, we had a bug [1] to document this. The result is [2]. It does
> not detail how to boot/reinstall os/etc., only restore (if e.g. db is dead
> but fs is ok).
> For something somewhat similar to what you want, see also [3], which uses
> guestfish. Might be useful, depending on how badly your disk is corrupted.
>
> How did you run into a split brain? There is a lock on the shared storage
> that should prevent this.
>

Also, to clarify:

The "official" answer is to deploy a new hosted-engine, on new storage,
with --restore-from-file. This IMO does not let you keep your VMs up, at
least not all of them, definitely if you don't have another host to restore
on.

Keeping the VMs up is risky if you have HA VMs, or if you
started/stopped/migrated VMs after you took your backup.

Best regards,
-- 
Didi
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZCCYFBSZVUY7YJ7L5Q5U3SG4CI3CPHNN/


[ovirt-users] Re: Fix corrupt self-hosted engine

2020-11-19 Thread Yedidyah Bar David
On Thu, Nov 19, 2020 at 4:37 PM Alex K  wrote:

> Hi all,
>
> I have a corrupt self-hosted engine (with several file system errors,
> postgres not able to start) and thus it does not give access to the web UI.
> This happened following an unlucky split brain resolution (I am running 2
> nodes). The two hosts are running VMs also which I would like to keep
> running as they are needed.
>
> When trying to boot into rescue mode (using systemd.unit=emergency.target
> boot parameter) I get a cursor and nothing else.
>

This means that more than just the DB is corrupt...


>
> I have backups of engine files with scope all (using the engine-backup
> tool).
> What is the best approach to try and fix the engine or redeploy.
>

If you are careful, and know what you are doing, you can try something like
the following. I am not giving many details, hopefully you can find on the
net tutorials about how to use the things I suggest:

1. Move to global maintenance

2. Stop the current dead vm (if needed)

3. Find current vm conf, edit it to boot from a rescue iso image of your
preference or from net/PXE etc., and start the vm with '--vm-conf' pointing
to your edited file.

4. Connect a console (hosted-engine --console, or 'virsh console', or use
'--add-console-password' and remote viewer, if needed)

5. Clean the disk and install the OS, oVirt, etc.

6. Copy your backup into the vm and restore with engine-backup

7. Then cleanly stop the machine, exit global maint, and let HA start it
(or start it yourself with --vm-start).

At the time, we had a bug [1] to document this. The result is [2]. It does
not detail how to boot/reinstall os/etc., only restore (if e.g. db is dead
but fs is ok).
For something somewhat similar to what you want, see also [3], which uses
guestfish. Might be useful, depending on how badly your disk is corrupted.

How did you run into a split brain? There is a lock on the shared storage
that should prevent this.

Good luck and best regards,

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1482710
[2]
https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_Self-Hosted_Engine
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4
-- 
Didi
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/2ALJN3CXYNC2UUCEI6H7HX3QU7YWUAML/