[ovirt-users] Re: Replace bad Host from a 9 Node hyperconverged setup 4.3.3

Adrian Quintero Tue, 11 Jun 2019 12:07:09 -0700

adding gluster pool list:
UUID Hostname         State
2c86fa95-67a2-492d-abf0-54da625417f8  vmm12.mydomain.com Connected
ab099e72-0f56-4d33-a16b-ba67d67bdf9d  vmm13.mydomain.com Connected
c35ad74d-1f83-4032-a459-079a27175ee4 vmm14.mydomain.com Connected
aeb7712a-e74e-4492-b6af-9c266d69bfd3  vmm17.mydomain.com Connected
4476d434-d6ff-480f-b3f1-d976f642df9c     vmm16.mydomain.com Connected
22ec0c0a-a5fc-431c-9f32-8b17fcd80298   vmm15.mydomain.com Connected
caf84e9f-3e03-4e6f-b0f8-4c5ecec4bef6    vmm18.mydomain.com Connected
18385970-aba6-4fd1-85a6-1b13f663e60b  vmm10.mydomain.com * Disconnected
//server that went bad.*
b152fd82-8213-451f-93c6-353e96aa3be9  vmm102.mydomain.com Connected
//vmm10 but with different name
228a9282-c04e-4229-96a6-67cb47629892 localhost
Connected


On Tue, Jun 11, 2019 at 11:24 AM Adrian Quintero <[email protected]>
wrote:

> Strahil,
>
> Looking at your suggestions I think I need to provide a bit more info on
> my current setup.
>
>
>
>    1.
>
>    I have 9 hosts in total
>    2.
>
>    I have 5 storage domains:
>    -
>
>       hosted_storage (Data Master)
>       -
>
>       vmstore1 (Data)
>       -
>
>       data1 (Data)
>       -
>
>       data2 (Data)
>       -
>
>       ISO (NFS) //had to create this one because oVirt 4.3.3.1 would not
>       let me upload disk images to a data domain without an ISO (I think this 
> is
>       due to a bug)
>
>       3.
>
>    Each volume is of the type “Distributed Replicate” and each one is
>    composed of 9 bricks.
>    I started with 3 bricks per volume due to the initial Hyperconverged
>    setup, then I expanded the cluster and the gluster cluster by 3 hosts at a
>    time until I got to a total of 9 hosts.
>
>
>    -
>
>
>
>
>
>
>
>
> *Disks, bricks and sizes used per volume / dev/sdb engine 100GB / dev/sdb
>       vmstore1 2600GB / dev/sdc data1 2600GB / dev/sdd data2 2600GB / dev/sde
>       -------- 400GB SSD Used for caching purposes From the above layout a few
>       questions came up:*
>       1.
>
>
>
> *Using the web UI, How can I create a 100GB brick and a 2600GB brick to
>          replace the bad bricks for “engine” and “vmstore1” within the same 
> block
>          device (sdb) ? What about / dev/sde (caching disk), When I tried 
> creating a
>          new brick thru the UI I saw that I could use / dev/sde for caching 
> but only
>          for 1 brick (i.e. vmstore1) so if I try to create another brick how 
> would I
>          specify it is the same / dev/sde device to be used for caching?*
>
>
>
>    1.
>
>    If I want to remove a brick and it being a replica 3, I go to storage
>    > Volumes > select the volume > bricks once in there I can select the 3
>    servers that compose the replicated bricks and click remove, this gives a
>    pop-up window with the following info:
>
>    Are you sure you want to remove the following Brick(s)?
>    - vmm11:/gluster_bricks/vmstore1/vmstore1
>    - vmm12.virt.iad3p:/gluster_bricks/vmstore1/vmstore1
>    - 192.168.0.100:/gluster-bricks/vmstore1/vmstore1
>    - Migrate Data from the bricks?
>
>    If I proceed with this that means I will have to do this for all the 4
>    volumes, that is just not very efficient, but if that is the only way, then
>    I am hesitant to put this into a real production environment as there is no
>    way I can take that kind of a hit for +500 vms :) and also I wont have
>    that much storage or extra volumes to play with in a real sceneario.
>
>    2.
>
>    After modifying yesterday */ etc/vdsm/vdsm.id <http://vdsm.id> by
>    following
>    (https://stijn.tintel.eu/blog/2013/03/02/ovirt-problem-duplicate-uuids
>    <https://stijn.tintel.eu/blog/2013/03/02/ovirt-problem-duplicate-uuids>) I
>    was able to add the server **back **to the cluster using a new fqdn
>    and a new IP, and tested replacing one of the bricks and this is my mistake
>    as mentioned in #3 above I used / dev/sdb entirely for 1 brick because thru
>    the UI I could not separate the block device and be used for 2 bricks (one
>    for the engine and one for vmstore1). **So in the “gluster vol info”
>    you might see vmm102.mydomain.com <http://vmm102.mydomain.com> *
> *but in reality it is myhost1.mydomain.com <http://myhost1.mydomain.com> *
>    3.
>
>    *I am also attaching gluster_peer_status.txt * *and in the last 2
>    entries of that file you will see and entry vmm10.mydomain.com
>    <http://vmm10.mydomain.com> (old/bad entry) and vmm102.mydomain.com
>    <http://vmm102.mydomain.com> (new entry, same server vmm10, but renamed to
>    vmm102). *
> *Also please find gluster_vol_info.txt file. *
>    4.
>
>    *I am ready *
> *to redeploy this environment if needed, but I am also ready to test any
>    other suggestion. If I can get a good understanding on how to recover from
>    this I will be ready to move to production. *
>    5.
>
>
>
> *Wondering if you’d be willing to have a look at my setup through a shared
>    screen? *
>
> *Thanks *
>
>
> *Adrian*
>
> On Mon, Jun 10, 2019 at 11:41 PM Strahil <[email protected]> wrote:
>
>> Hi Adrian,
>>
>> You have several options:
>> A) If you have space on another gluster volume (or volumes) or on
>> NFS-based storage, you can migrate all VMs live . Once you do it,  the
>> simple way will be to stop and remove the storage domain (from UI) and
>> gluster volume that correspond to the problematic brick. Once gone, you
>> can  remove the entry in oVirt for the old host and add the newly built
>> one.Then you can recreate your volume and migrate the data back.
>>
>> B)  If you don't have space you have to use a more riskier approach
>> (usually it shouldn't be risky, but I had bad experience in gluster v3):
>> - New server has same IP and hostname:
>> Use command line and run the 'gluster volume reset-brick VOLNAME
>> HOSTNAME:BRICKPATH HOSTNAME:BRICKPATH commit'
>> Replace VOLNAME with your volume name.
>> A more practical example would be:
>> 'gluster volume reset-brick data ovirt3:/gluster_bricks/data/brick
>> ovirt3:/gluster_ ricks/data/brick commit'
>>
>> If it refuses, then you have to cleanup '/gluster_bricks/data' (which
>> should be empty).
>> Also check if the new peer has been probed via 'gluster peer
>> status'.Check the firewall is allowing gluster communication (you can
>> compare it to the firewalls on another gluster host).
>>
>> The automatic healing will kick in 10 minutes (if it succeeds) and will
>> stress the other 2 replicas, so pick your time properly.
>> Note: I'm not recommending you to use the 'force' option in the previous
>> command ... for now :)
>>
>> - The new server has a different IP/hostname:
>> Instead of 'reset-brick' you can use  'replace-brick':
>> It should be like this:
>> gluster volume replace-brick data old-server:/path/to/brick
>> new-server:/new/path/to/brick commit force
>>
>> In both cases check the status via:
>> gluster volume info VOLNAME
>>
>> If your cluster is in production , I really recommend you the first
>> option as it is less risky and the chance for unplanned downtime will be
>> minimal.
>>
>> The 'reset-brick'  in your previous e-mail shows that one of the servers
>> is not connected. Check peer status on all servers, if they are less than
>> they should check for network and/or firewall issues.
>> On the new node check if glusterd is enabled and running.
>>
>> In order to debug - you should provide more info like 'gluster volume
>> info' and the peer status from each node.
>>
>> Best Regards,
>> Strahil Nikolov
>>
>> On Jun 10, 2019 20:10, Adrian Quintero <[email protected]> wrote:
>>
>> >
>> > Can you let me know how to fix the gluster and missing brick?,
>> > I tried removing it by going to "storage > Volumes > vmstore > bricks >
>> selected the brick
>> > However it is showing as an unknown status (which is expected because
>> the server was completely wiped) so if I try to "remove", "replace brick"
>> or "reset brick" it wont work
>> > If i do remove brick: Incorrect bricks selected for removal in
>> Distributed Replicate volume. Either all the selected bricks should be from
>> the same sub volume or one brick each for every sub volume!
>> > If I try "replace brick" I cant because I dont have another server with
>> extra bricks/disks
>> > And if I try "reset brick": Error while executing action Start Gluster
>> Volume Reset Brick: Volume reset brick commit force failed: rc=-1 out=()
>> err=['Host myhost1_mydomain_com  not connected']
>> >
>> > Are you suggesting to try and fix the gluster using command line?
>> >
>> > Note that I cant "peer detach"   the sever , so if I force the removal
>> of the bricks would I need to force downgrade to replica 2 instead of 3?
>> what would happen to oVirt as it only supports replica 3?
>> >
>> > thanks again.
>> >
>> > On Mon, Jun 10, 2019 at 12:52 PM Strahil <[email protected]> wrote:
>>
>> >>
>> >> Hi Adrian,
>> >> Did you fix the issue with the gluster and the missing brick?
>> >> If yes, try to set the 'old' host in maintenance an
>>
>>
>
> --
> Adrian Quintero
>


-- 
Adrian Quintero

_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/GWTF2PJ7FHPIKIFLRXCR35AC7HMCSTTJ/

[ovirt-users] Re: Replace bad Host from a 9 Node hyperconverged setup 4.3.3

Reply via email to