[ovirt-users] Re: Replace bad Host from a 9 Node hyperconverged setup 4.3.3

Strahil Nikolov Tue, 11 Jun 2019 15:27:28 -0700

 Do you have empty space to store the VMs ? If yes, you can always script the 
migration of the disks via the API . Even a bash script and curl can do the 
trick.
About the /dev/sdb , I still don't get it . A pure "df -hT" from a node will 
make it way clear. I guess '/dev/sdb' is a PV and you got 2 LVs ontop of it.
Note: I should admit that as an admin - I don't use UI for gluster management.
For now do not try to remove the brick. The approach is either to migrate the 
qemu disks to another storage or to reset-brick/replace-brick in order to 
restore the replica count.I will check the file and I will try to figure it out.
Redeployment never fixes the issue, it just speeds up the recovery. If you can 
afford the time to spent on fixing the issue - then do not redeploy.
I would be able to take a look next week , but keep in mind that I'm not so in 
deep with oVirt - I have started playing with it when I deployed my lab.
Best Regards,Strahil Nikolov 
 Strahil,
  
Looking at yoursuggestions I think I need to provide a bit more info on my 
currentsetup.




   
   -    
I have 9 hosts in total
 
   -    
I have 5 storage domains:
       
      -       
hosted_storage (Data Master)
 
      -       
vmstore1 (Data)
 
      -       
data1 (Data)
 
      -       
data2 (Data)
 
      -       
ISO (NFS) //had to create this one because oVirt 4.3.3.1 would not let me 
upload disk images to a data domain without an ISO (I think this is due to a 
bug)      
      
 
 
 
   -    
Each volume is of the type “Distributed Replicate” and each one is composed of 
9 bricks.   
I started with 3 bricks per volume due to the initial Hyperconverged setup, 
then I expanded the cluster and the gluster cluster by 3 hosts at a time until 
I got to a total of 9 hosts.

   
   
   -    
Disks, bricks and sizes used per volume   
 / dev/sdb engine 100GB   
 / dev/sdb vmstore1 2600GB   
 / dev/sdc data1 2600GB   
 / dev/sdd data2 2600GB   
/ dev/sde -------- 400GB SSD Used for caching purposes   
   
>From the above layout a few questions came up:
       
      -       
Using the web UI, How can I create a 100GB brick and a 2600GB brick to replace 
the bad bricks for “engine” and “vmstore1” within the same block device (sdb) ? 
      
      
What about / dev/sde (caching disk), When I tried creating a new brick thru the 
UI I saw that I could use / dev/sde for caching but only for 1 brick (i.e. 
vmstore1) so if I try to create another brick how would I specify it is the 
same / dev/sde device to be used for caching?
 
 



   
   -    
If I want to remove a brick and it being a replica 3, I go to storage > Volumes 
> select the volume > bricks once in there I can select the 3 servers that 
compose the replicated bricks and click remove, this gives a pop-up window with 
the following info:   
   
Are you sure you want to remove the following Brick(s)?   
- vmm11:/gluster_bricks/vmstore1/vmstore1   
- vmm12.virt.iad3p:/gluster_bricks/vmstore1/vmstore1   
- 192.168.0.100:/gluster-bricks/vmstore1/vmstore1   
- Migrate Data from the bricks?   
   
If I proceed with this that means I will have to do this for all the 4 volumes, 
that is just not very efficient, but if that is the only way, then I am 
hesitant to put this into a real production environment as there is no way I 
can take that kind of a hit for +500 vms :) and also I wont have that much 
storage or extra volumes to play with in a real sceneario.   
   
 
 
   -    
After modifying yesterday / etc/vdsm/vdsm.id by following 
(https://stijn.tintel.eu/blog/2013/03/02/ovirt-problem-duplicate-uuids) I was 
able to add the server back to the cluster using a new fqdn and a new IP, and 
tested replacing one of the bricks and this is my mistake as mentioned in #3 
above I used / dev/sdb entirely for 1 brick because thru the UI I could not 
separate the block device and be used for 2 bricks (one for the engine and one 
for vmstore1). So in the “gluster vol info” you might see vmm102.mydomain.com 
but in reality it is myhost1.mydomain.com   
   
 
 
   -    
I am also attaching gluster_peer_status.txt  and in the last 2 entries of that 
file you will see and entry vmm10.mydomain.com (old/bad entry) and 
vmm102.mydomain.com (new entry, same server vmm10, but renamed to vmm102). Also 
please find gluster_vol_info.txt file.   
   
 
 
   -    
I am ready to redeploy this environment if needed, but I am also ready to test 
any other suggestion. If I can get a good understanding on how to recover from 
this I will be ready to move to production.   
   
 
 
   -    
Wondering if you’d be willing to have a look at my setup through a shared 
screen?   
   
   
 


Thanks




Adrian

On Mon, Jun 10, 2019 at 11:41 PM Strahil <hunter86...@yahoo.com> wrote:


Hi Adrian,

You have several options:
A) If you have space on another gluster volume (or volumes) or on NFS-based 
storage, you can migrate all VMs live . Once you do it,  the simple way will be 
to stop and remove the storage domain (from UI) and gluster volume that 
correspond to the problematic brick. Once gone, you can  remove the entry in 
oVirt for the old host and add the newly built one.Then you can recreate your 
volume and migrate the data back.

B)  If you don't have space you have to use a more riskier approach (usually it 
shouldn't be risky, but I had bad experience in gluster v3):
- New server has same IP and hostname:
Use command line and run the 'gluster volume reset-brick VOLNAME 
HOSTNAME:BRICKPATH HOSTNAME:BRICKPATH commit'
Replace VOLNAME with your volume name.
A more practical example would be:
'gluster volume reset-brick data ovirt3:/gluster_bricks/data/brick 
ovirt3:/gluster_ ricks/data/brick commit'

If it refuses, then you have to cleanup '/gluster_bricks/data' (which should be 
empty).
Also check if the new peer has been probed via 'gluster peer status'.Check the 
firewall is allowing gluster communication (you can compare it to the firewalls 
on another gluster host).


The automatic healing will kick in 10 minutes (if it succeeds) and will stress 
the other 2 replicas, so pick your time properly.
Note: I'm not recommending you to use the 'force' option in the previous 
command ... for now :)

- The new server has a different IP/hostname:
Instead of 'reset-brick' you can use  'replace-brick':
It should be like this:
gluster volume replace-brick data old-server:/path/to/brick 
new-server:/new/path/to/brick commit force

In both cases check the status via:
gluster volume info VOLNAME

If your cluster is in production , I really recommend you the first option as 
it is less risky and the chance for unplanned downtime will be minimal.

The 'reset-brick'  in your previous e-mail shows that one of the servers is not 
connected. Check peer status on all servers, if they are less than they should 
check for network and/or firewall issues.
On the new node check if glusterd is enabled and running.

In order to debug - you should provide more info like 'gluster volume info' and 
the peer status from each node.

Best Regards,
Strahil Nikolov

On Jun 10, 2019 20:10, Adrian Quintero <adrianquint...@gmail.com> wrote:



>
> Can you let me know how to fix the gluster and missing brick?,
> I tried removing it by going to "storage > Volumes > vmstore > bricks > 
> selected the brick
> However it is showing as an unknown status (which is expected because the 
> server was completely wiped) so if I try to "remove", "replace brick" or 
> "reset brick" it wont work 
> If i do remove brick: Incorrect bricks selected for removal in Distributed 
> Replicate volume. Either all the selected bricks should be from the same sub 
> volume or one brick each for every sub volume!
> If I try "replace brick" I cant because I dont have another server with extra 
> bricks/disks
> And if I try "reset brick": Error while executing action Start Gluster Volume 
> Reset Brick: Volume reset brick commit force failed: rc=-1 out=() err=['Host 
> myhost1_mydomain_com  not connected']
>
> Are you suggesting to try and fix the gluster using command line? 
>
> Note that I cant "peer detach"   the sever , so if I force the removal of the 
> bricks would I need to force downgrade to replica 2 instead of 3? what would 
> happen to oVirt as it only supports replica 3?
>
> thanks again.
>
> On Mon, Jun 10, 2019 at 12:52 PM Strahil <hunter86...@yahoo.com> wrote:





>>
>> Hi Adrian,
>> Did you fix the issue with the gluster and the missing brick?
>> If yes, try to set the 'old' host in maintenance an





-- 
Adrian Quintero

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/K65ROMQARXI5ZHX524D4XZBAFSFKBWFU/

[ovirt-users] Re: Replace bad Host from a 9 Node hyperconverged setup 4.3.3

Reply via email to