[ovirt-users] Re: Replace bad Host from a 9 Node hyperconverged setup 4.3.3

Adrian Quintero Tue, 11 Jun 2019 10:58:24 -0700

Strahil,

Looking at your suggestions I think I need to provide a bit more info on my
current setup.




   1.

   I have 9 hosts in total
   2.

   I have 5 storage domains:
   -

      hosted_storage (Data Master)
      -

      vmstore1 (Data)
      -

      data1 (Data)
      -

      data2 (Data)
      -

      ISO (NFS) //had to create this one because oVirt 4.3.3.1 would not
      let me upload disk images to a data domain without an ISO (I
think this is
      due to a bug)

      3.

   Each volume is of the type “Distributed Replicate” and each one is
   composed of 9 bricks.
   I started with 3 bricks per volume due to the initial Hyperconverged
   setup, then I expanded the cluster and the gluster cluster by 3 hosts at a
   time until I got to a total of 9 hosts.


   -








*Disks, bricks and sizes used per volume / dev/sdb engine 100GB / dev/sdb
      vmstore1 2600GB / dev/sdc data1 2600GB / dev/sdd data2 2600GB / dev/sde
      -------- 400GB SSD Used for caching purposes From the above layout a few
      questions came up:*
      1.



*Using the web UI, How can I create a 100GB brick and a 2600GB brick to
         replace the bad bricks for “engine” and “vmstore1” within the
same block
         device (sdb) ? What about / dev/sde (caching disk), When I
tried creating a
         new brick thru the UI I saw that I could use / dev/sde for
caching but only
         for 1 brick (i.e. vmstore1) so if I try to create another
brick how would I
         specify it is the same / dev/sde device to be used for caching?*



   1.

   If I want to remove a brick and it being a replica 3, I go to storage >
   Volumes > select the volume > bricks once in there I can select the 3
   servers that compose the replicated bricks and click remove, this gives a
   pop-up window with the following info:

   Are you sure you want to remove the following Brick(s)?
   - vmm11:/gluster_bricks/vmstore1/vmstore1
   - vmm12.virt.iad3p:/gluster_bricks/vmstore1/vmstore1
   - 192.168.0.100:/gluster-bricks/vmstore1/vmstore1
   - Migrate Data from the bricks?

   If I proceed with this that means I will have to do this for all the 4
   volumes, that is just not very efficient, but if that is the only way, then
   I am hesitant to put this into a real production environment as there is no
   way I can take that kind of a hit for +500 vms :) and also I wont have
   that much storage or extra volumes to play with in a real sceneario.

   2.

   After modifying yesterday */ etc/vdsm/vdsm.id <http://vdsm.id> by
   following
   (https://stijn.tintel.eu/blog/2013/03/02/ovirt-problem-duplicate-uuids
   <https://stijn.tintel.eu/blog/2013/03/02/ovirt-problem-duplicate-uuids>) I
   was able to add the server **back **to the cluster using a new fqdn and
   a new IP, and tested replacing one of the bricks and this is my mistake as
   mentioned in #3 above I used / dev/sdb entirely for 1 brick because thru
   the UI I could not separate the block device and be used for 2 bricks (one
   for the engine and one for vmstore1). **So in the “gluster vol info” you
   might see vmm102.mydomain.com <http://vmm102.mydomain.com> *
*but in reality it is myhost1.mydomain.com <http://myhost1.mydomain.com> *
   3.

   *I am also attaching gluster_peer_status.txt * *and in the last 2
   entries of that file you will see and entry vmm10.mydomain.com
   <http://vmm10.mydomain.com> (old/bad entry) and vmm102.mydomain.com
   <http://vmm102.mydomain.com> (new entry, same server vmm10, but renamed to
   vmm102). *
*Also please find gluster_vol_info.txt file. *
   4.

   *I am ready *
*to redeploy this environment if needed, but I am also ready to test any
   other suggestion. If I can get a good understanding on how to recover from
   this I will be ready to move to production. *
   5.



*Wondering if you’d be willing to have a look at my setup through a shared
   screen? *

*Thanks *


*Adrian*

On Mon, Jun 10, 2019 at 11:41 PM Strahil <[email protected]> wrote:

> Hi Adrian,
>
> You have several options:
> A) If you have space on another gluster volume (or volumes) or on
> NFS-based storage, you can migrate all VMs live . Once you do it,  the
> simple way will be to stop and remove the storage domain (from UI) and
> gluster volume that correspond to the problematic brick. Once gone, you
> can  remove the entry in oVirt for the old host and add the newly built
> one.Then you can recreate your volume and migrate the data back.
>
> B)  If you don't have space you have to use a more riskier approach
> (usually it shouldn't be risky, but I had bad experience in gluster v3):
> - New server has same IP and hostname:
> Use command line and run the 'gluster volume reset-brick VOLNAME
> HOSTNAME:BRICKPATH HOSTNAME:BRICKPATH commit'
> Replace VOLNAME with your volume name.
> A more practical example would be:
> 'gluster volume reset-brick data ovirt3:/gluster_bricks/data/brick
> ovirt3:/gluster_ ricks/data/brick commit'
>
> If it refuses, then you have to cleanup '/gluster_bricks/data' (which
> should be empty).
> Also check if the new peer has been probed via 'gluster peer status'.Check
> the firewall is allowing gluster communication (you can compare it to the
> firewalls on another gluster host).
>
> The automatic healing will kick in 10 minutes (if it succeeds) and will
> stress the other 2 replicas, so pick your time properly.
> Note: I'm not recommending you to use the 'force' option in the previous
> command ... for now :)
>
> - The new server has a different IP/hostname:
> Instead of 'reset-brick' you can use  'replace-brick':
> It should be like this:
> gluster volume replace-brick data old-server:/path/to/brick
> new-server:/new/path/to/brick commit force
>
> In both cases check the status via:
> gluster volume info VOLNAME
>
> If your cluster is in production , I really recommend you the first option
> as it is less risky and the chance for unplanned downtime will be minimal.
>
> The 'reset-brick'  in your previous e-mail shows that one of the servers
> is not connected. Check peer status on all servers, if they are less than
> they should check for network and/or firewall issues.
> On the new node check if glusterd is enabled and running.
>
> In order to debug - you should provide more info like 'gluster volume
> info' and the peer status from each node.
>
> Best Regards,
> Strahil Nikolov
>
> On Jun 10, 2019 20:10, Adrian Quintero <[email protected]> wrote:
>
> >
> > Can you let me know how to fix the gluster and missing brick?,
> > I tried removing it by going to "storage > Volumes > vmstore > bricks >
> selected the brick
> > However it is showing as an unknown status (which is expected because
> the server was completely wiped) so if I try to "remove", "replace brick"
> or "reset brick" it wont work
> > If i do remove brick: Incorrect bricks selected for removal in
> Distributed Replicate volume. Either all the selected bricks should be from
> the same sub volume or one brick each for every sub volume!
> > If I try "replace brick" I cant because I dont have another server with
> extra bricks/disks
> > And if I try "reset brick": Error while executing action Start Gluster
> Volume Reset Brick: Volume reset brick commit force failed: rc=-1 out=()
> err=['Host myhost1_mydomain_com  not connected']
> >
> > Are you suggesting to try and fix the gluster using command line?
> >
> > Note that I cant "peer detach"   the sever , so if I force the removal
> of the bricks would I need to force downgrade to replica 2 instead of 3?
> what would happen to oVirt as it only supports replica 3?
> >
> > thanks again.
> >
> > On Mon, Jun 10, 2019 at 12:52 PM Strahil <[email protected]> wrote:
>
> >>
> >> Hi Adrian,
> >> Did you fix the issue with the gluster and the missing brick?
> >> If yes, try to set the 'old' host in maintenance an
>
>

-- 
Adrian Quintero

Volume Name: data1
Type: Distributed-Replicate
Volume ID: a953be2a-a23f-4425-bf61-1a27fa029975
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: vmm10.mydomain.com.:/gluster_bricks/data1/data1
Brick2: vmm11.mydomain.com:/gluster_bricks/data1/data1
Brick3: vmm12.mydomain.com:/gluster_bricks/data1/data1
Brick4: vmm13.mydomain.com:/gluster_bricks/data1/data1
Brick5: vmm14.mydomain.com:/gluster_bricks/data1/data1
Brick6: vmm15.mydomain.com:/gluster_bricks/data1/data1
Brick7: vmm16.mydomain.com:/gluster_bricks/data1/data1
Brick8: vmm17.mydomain.com:/gluster_bricks/data1/data1
Brick9: vmm18.mydomain.com:/gluster_bricks/data1/data1
Options Reconfigured:
cluster.granular-entry-heal: enable
storage.owner-gid: 36
storage.owner-uid: 36
network.ping-timeout: 30
cluster.choose-local: off
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: off
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
performance.strict-o-direct: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
 
Volume Name: data2
Type: Distributed-Replicate
Volume ID: b5254bbb-a6a1-4f79-9513-d01f24331d03
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: vmm10.mydomain.com.:/gluster_bricks/data2/data2
Brick2: vmm11.mydomain.com:/gluster_bricks/data2/data2
Brick3: vmm12.mydomain.com:/gluster_bricks/data2/data2
Brick4: vmm13.mydomain.com:/gluster_bricks/data2/data2
Brick5: vmm14.mydomain.com:/gluster_bricks/data2/data2
Brick6: vmm15.mydomain.com:/gluster_bricks/data2/data2
Brick7: vmm16.mydomain.com:/gluster_bricks/data2/data2
Brick8: vmm17.mydomain.com:/gluster_bricks/data2/data2
Brick9: vmm18.mydomain.com:/gluster_bricks/data2/data2
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.strict-o-direct: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
network.ping-timeout: 30
storage.owner-uid: 36
storage.owner-gid: 36
cluster.granular-entry-heal: enable
 
Volume Name: engine
Type: Distributed-Replicate
Volume ID: e89321ed-bf10-4d24-a376-f86656b3d65c
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: vmm10.mydomain.com.:/gluster_bricks/engine/engine
Brick2: vmm11.mydomain.com:/gluster_bricks/engine/engine
Brick3: vmm12.mydomain.com:/gluster_bricks/engine/engine
Brick4: vmm13.mydomain.com:/gluster_bricks/engine/engine
Brick5: vmm14.mydomain.com:/gluster_bricks/engine/engine
Brick6: vmm15.mydomain.com:/gluster_bricks/engine/engine
Brick7: vmm16.mydomain.com:/gluster_bricks/engine/engine
Brick8: vmm17.mydomain.com:/gluster_bricks/engine/engine
Brick9: vmm18.mydomain.com:/gluster_bricks/engine/engine
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.strict-o-direct: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
network.ping-timeout: 30
storage.owner-uid: 36
storage.owner-gid: 36
cluster.granular-entry-heal: enable
 
Volume Name: vmstore1
Type: Distributed-Replicate
Volume ID: 19c4d170-3b79-44c4-8dbd-20dc49beb8b2
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 192.168.0.100:/gluster-bricks/vmstore1/vmstore1
Brick2: vmm11.mydomain.com:/gluster_bricks/vmstore1/vmstore1
Brick3: vmm12.mydomain.com:/gluster_bricks/vmstore1/vmstore1
Brick4: vmm13.mydomain.com:/gluster_bricks/vmstore1/vmstore1
Brick5: vmm14.mydomain.com:/gluster_bricks/vmstore1/vmstore1
Brick6: vmm15.mydomain.com:/gluster_bricks/vmstore1/vmstore1
Brick7: vmm16.mydomain.com:/gluster_bricks/vmstore1/vmstore1
Brick8: vmm17.mydomain.com:/gluster_bricks/vmstore1/vmstore1
Brick9: vmm18.mydomain.com:/gluster_bricks/vmstore1/vmstore1
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.strict-o-direct: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
network.ping-timeout: 30
storage.owner-uid: 36
storage.owner-gid: 36
cluster.granular-entry-heal: enable

Number of Peers: 9

Hostname: vmm12.mydomain.com
Uuid: 2c86fa95-67a2-492d-abf0-54da625417f8
State: Peer in Cluster (Connected)
Other names:
192.168.0.4
172.26.0.26

Hostname: vmm13.mydomain.com
Uuid: ab099e72-0f56-4d33-a16b-ba67d67bdf9d
State: Peer in Cluster (Connected)
Other names:
172.26.0.27

Hostname: vmm14.mydomain.com
Uuid: c35ad74d-1f83-4032-a459-079a27175ee4
State: Peer in Cluster (Connected)
Other names:
172.26.0.28

Hostname: vmm17.mydomain.com
Uuid: aeb7712a-e74e-4492-b6af-9c266d69bfd3
State: Peer in Cluster (Connected)
Other names:
192.168.0.9
172.26.0.32

Hostname: vmm16.mydomain.com
Uuid: 4476d434-d6ff-480f-b3f1-d976f642df9c
State: Peer in Cluster (Connected)
Other names:
192.168.0.8
172.26.0.31

Hostname: vmm15.mydomain.com
Uuid: 22ec0c0a-a5fc-431c-9f32-8b17fcd80298
State: Peer in Cluster (Connected)
Other names:
172.26.0.29

Hostname: vmm18.mydomain.com
Uuid: caf84e9f-3e03-4e6f-b0f8-4c5ecec4bef6
State: Peer in Cluster (Connected)
Other names:
192.168.0.10
172.26.0.33

Hostname: vmm10.mydomain.com
Uuid: 18385970-aba6-4fd1-85a6-1b13f663e60b
State: Peer in Cluster (Disconnected)
Other names:
192.168.0.2
192.168.0.21
172.26.0.4

Hostname: vmm102.mydomain.com
Uuid: b152fd82-8213-451f-93c6-353e96aa3be9
State: Peer in Cluster (Connected)
Other names:
192.168.0.100

_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/RIDY2VGXOBTBDJDRQSTXXBP5JDSDT3E7/

[ovirt-users] Re: Replace bad Host from a 9 Node hyperconverged setup 4.3.3

Reply via email to