[ovirt-users] Re: oVirt over gluster: Replacing a dead host

Patrick Hibbs Sun, 17 Jul 2022 15:31:58 -0700

What you are missing is the fact that gluster requires more than one
set of bricks to recover from a dead host. I.e. In your set up, you'd
need 6 hosts. 4x replicas and 2x arbiters with at least one set (2x
replicas and 1x arbiter) operational bare minimum.
Automated commands to fix the volume do not exist otherwise. (It's a
Gluster limitation.) This can be fixed manually however.

Standard Disclaimer: Back up your data first! Fixing this issue
requires manual intervention. Reader assumes all responsiblity for any
action resulting from the instructions below. Etc.

If it's just a dead brick, (i.e. the host is still functional), all you
really need to do is replace the underlying storage:

1. Take the gluster volume offline.
2. Remove the bad storage device, and attach the replacement.
3. rsync / scp / etc. the data from a known good brick (be sure to
include hidden files / preserve file times and ownership / SELinux
labels / etc. ). 
4. Restart the gluster volume.

Gluster *might* still need to heal everything after all of that, but it
should start the volume and get it running again.

If the host itself is dead, (and the underlying storage is still
functional), you can just move the underlying storage over to the new
host:

1. Take the gluster volume offline.
2. Attach the old storage.
3. Fix up the ids on the volume file.
(https://serverfault.com/questions/631365/rename-a-glusterfs-peer)
4. Restart the gluster volume.

If both the host and underlying storage are dead, you'll need to do
both tasks:

1. Take the gluster volume offline.
2. Attach the new storage.
3. rsync / scp / etc. the data from a known good brick (be sure to
include hidden files / preserve file times and ownership / SELinux
labels / etc. ).
4. Fix up the ids on the volume file.
5. Restart the gluster volume.

Keep in mind one thing however: If the gluster host you are replacing
is used by oVirt to connect to the volume (I.e. It's the host named in
the volume config in the Admin portal). The new host will need to
retain the old hostname / IP, or you'll need to update oVirt's config.
Otherwise the VM hosts will wind up in Unassigned / Non-functional
status.

- Patrick Hibbs

On Sun, 2022-07-17 at 22:15 +0300, Gilboa Davara wrote:
> Hello all,
> 
> I'm attempting to replace a dead host in a replica 2 + arbiter
> gluster setup and replace it with a new host.
> I've already set up a new host (same hostname..localdomain) and got
> into the cluster.
> 
> $ gluster peer status
> Number of Peers: 2
> 
> Hostname: office-wx-hv3-lab-gfs
> Uuid: 4e13f796-b818-4e07-8523-d84eb0faa4f9
> State: Peer in Cluster (Connected)
> 
> Hostname: office-wx-hv1-lab-gfs.localdomain <------ This is a new
> host.
> Uuid: eee17c74-0d93-4f92-b81d-87f6b9c2204d
> State: Peer in Cluster (Connected)
> 
> $ gluster volume info GV2Data
>  Volume Name: GV2Data
> Type: Replicate
> Volume ID: c1946fc2-ed94-4b9f-9da3-f0f1ee90f303
> Status: Stopped
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: office-wx-hv1-lab-gfs:/mnt/LogGFSData/brick  <------ This is
> the dead host.
> Brick2: office-wx-hv2-lab-gfs:/mnt/LogGFSData/brick
> Brick3: office-wx-hv3-lab-gfs:/mnt/LogGFSData/brick (arbiter)
> ...
> 
> Looking at the docs, it seems that I need to remove the dead brick.
> 
> $ gluster volume remove-brick GV2Data office-wx-hv1-lab-
> gfs:/mnt/LogGFSData/brick start
> Running remove-brick with cluster.force-migration enabled can result
> in data corruption. It is safer to disable this option so that files
> that receive writes during migration are not migrated.
> Files that are not migrated can then be manually copied after the
> remove-brick commit operation.
> Do you want to continue with your current cluster.force-migration
> settings? (y/n) y
> volume remove-brick start: failed: Removing bricks from replicate
> configuration is not allowed without reducing replica count
> explicitly
> 
> So I guess I need to drop from replica 2 + arbiter to replica 1 +
> arbiter (?).
> 
> $ gluster volume remove-brick GV2Data replica 1 office-wx-hv1-lab-
> gfs:/mnt/LogGFSData/brick start
> Running remove-brick with cluster.force-migration enabled can result
> in data corruption. It is safer to disable this option so that files
> that receive writes during migration are not migrated.
> Files that are not migrated can then be manually copied after the
> remove-brick commit operation.
> Do you want to continue with your current cluster.force-migration
> settings? (y/n) y
> volume remove-brick start: failed: need 2(xN) bricks for reducing
> replica count of the volume from 3 to 1
> 
> ... What am I missing?
> 
> - Gilboa
> _______________________________________________
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/OIXTFTJREUAHGP3WUW7DFL3VJNEMFJLF/

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/MAEPVKYSBBY3ZXO7T3VHVCPLQINDLGFO/

[ovirt-users] Re: oVirt over gluster: Replacing a dead host

Reply via email to