Tim,
Try deleting all the pods, the glusterfs pods and the heketi pod. Do
it one at a time. I’ve had this work for me where the pods came back
up and heketi was ok.
Also can try restarting glusterfs glusterd in the pod term on each
pod. That’s worked for me to get out of heketi db issues.
Other than that I don’t have any other ideas. I’ve not found good
information on how to resolve or troubleshoot issues like this.
Thanks,
Todd
On 8/24/18, 4:37 AM, "Tim Dudgeon" <tdudgeon...@gmail.com> wrote:
Todd,
Thanks for that. Seems on the lines that I need.
The problem though is that I have an additional problem of the
heketi
pod not starting because of a messed up database configuration.
These two problems happened independently, but on the same
OpenShift
environment.
This means I'm unable to run the heketi-cli until that is fixed.
I'm not sure if I can modify the heketi database configuration as
described in the troubleshooting guide [1] so that it only knows
about
the two good gluster nodes, and then add back the third one?
Any thoughts?
Tim
[1]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fheketi%2Fheketi%2Fblob%2Fmaster%2Fdocs%2Ftroubleshooting.md&data=01%7C01%7CTodd_Walters%40unigroup.com%7Cee469430dd3e43c27c2508d609a5284e%7C259bdc2f86d3477b8cb34eee64289142%7C1&sdata=uzBTRZV59mpOQfdWWO%2FMFmwIZO8NIu6cpPp%2Bjz%2BD4Yc%3D&reserved=0
On 23/08/18 17:14, Walters, Todd wrote:
> Tim,
>
> I have had this issue with 3 node cluster. I created a new
node with new devices, ran scaleup and ran gluster playbook with some
changes, then ran heketi-cli commands to add new node and remove old
node.
>
> For your other question, I’ve restarted all glusterfs pods and
hekeit pod and resolved that issue before. I guess you can restart
glusterd in each pod too?
>
> Here’s doc I wrote on node replacement. I’m not sure if this
is proper procedure, but it works, and I wasn’t able to find any
decent solution in the docs.
>
> # ----- Replacing a Failed Node ---- #
>
> Disable Node to simulate failure
> Get node id with heketi-cli node list or topology info
>
> heketi-cli node disable fb344a2ea889c7e25a772e747eeeec2a -s
http://localhost:8080 --user admin --secret "$HEKETI_CLI_KEY"
> Node fb344a2ea889c7e25a772e747eeeec2a is now offline
>
> Stop Node in AWS Console
> Scale up another node (4) for Gluster via Terraform
> Run scaleup_node.yml playbook
>
> Add New Node and Device
>
> heketi-cli node add --zone=1
--cluster=441248c1b2f032a93aca4a4e03648b28
--management-host-name=ip-new-node.ec2.internal
--storage-host-name=newnodeIP -s http://localhost:8080 --user admin
--secret "$HEKETI_CLI_KEY"
> heketi-cli device add --name /dev/xvdc --node
8973b41d8a4e437bd8b36d7df1a93f06 -s http://localhost:8080 --user
admin --secret "$HEKETI_CLI_KEY"
>
>
> Run deploy_gluster playbook, with the following changes in OSEv3
>
> - openshift_storage_glusterfs_wipe: False
> - openshift_storage_glusterfs_is_missing: False
> - openshift_storage_glusterfs_heketi_is_missing: False
>
> Verify topology
> rsh into heketi pod
> run heketi-exports (file i created with export commands)
> get old and new node info (id)
>
> Remove Node
>
> sh-4.4# heketi-cli node remove
fb344a2ea889c7e25a772e747eeeec2a -s http://localhost:8080 --user
admin --secret "$HEKETI_CLI_KEY"
> Node fb344a2ea889c7e25a772e747eeeec2a is now removed
>
>
> Remove All Devices (check the topology)
>
> sh-4.4# heketi-cli device delete
ea85942eaec73cb666c4e3dcec8b3702 -s http://localhost:8080 --user
admin --secret "$HEKETI_CLI_KEY"
> Device ea85942eaec73cb666c4e3dcec8b3702 deleted
>
>
> Delete the Node
>
> sh-4.4# heketi-cli node delete
fb344a2ea889c7e25a772e747eeeec2a -s http://localhost:8080 --user
admin --secret "$HEKETI_CLI_KEY"
> Node fb344a2ea889c7e25a772e747eeeec2a deleted
>
>
> Verify New Topology
>
> $ heketi-cli topology info
> make sure new node and device is listed.
>
>
> Thanks,
>
> Todd
>
> # -----------------------
>
> Check any existing pvc is still accessible.
> Today's Topics:
> 2. Replacing failed gluster node (Tim Dudgeon)
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 23 Aug 2018 15:40:29 +0100
> From: Tim Dudgeon <tdudgeon...@gmail.com>
> To: users <users@lists.openshift.redhat.com>
> Subject: Replacing failed gluster node
> Message-ID: <b3999ba8-d4bd-16e1-6135-ca5d9ed76...@gmail.com>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> I have a 3 node containerised glusterfs setup, and one of
the nodes has
> just died.
> I believe I can recover the disks that were used for the
gluster storage.
> What is the best approach to replacing that node with a
new one?
> Can I just create a new node with empty disks mounted and
use the
> scaleup.yml playbook and [new_nodes] section, or should I
be creating a
> node that re-uses the existing drives?
>
> Tim
>
>
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> users@lists.openshift.redhat.com
>
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openshift.redhat.com%2Fopenshiftmm%2Flistinfo%2Fusers&data=01%7C01%7CTodd_Walters%40unigroup.com%7Cee469430dd3e43c27c2508d609a5284e%7C259bdc2f86d3477b8cb34eee64289142%7C1&sdata=ut6aI0zzqt4mGBImtX0zTdgm1LNKFVv5ksE1eYjQG40%3D&reserved=0
>
>
> End of users Digest, Vol 73, Issue 44
> *************************************
>
>
>
>
########################################################################
> The information contained in this message, and any attachments
thereto,
> is intended solely for the use of the addressee(s) and may
contain
> confidential and/or privileged material. Any review,
retransmission,
> dissemination, copying, or other use of the transmitted
information is
> prohibited. If you received this in error, please contact the
sender
> and delete the material from any computer. UNIGROUP.COM
>
########################################################################
>
########################################################################
The information contained in this message, and any attachments thereto,
is intended solely for the use of the addressee(s) and may contain
confidential and/or privileged material. Any review, retransmission,
dissemination, copying, or other use of the transmitted information is
prohibited. If you received this in error, please contact the sender
and delete the material from any computer. UNIGROUP.COM
########################################################################