Re: [ovirt-users] Ovirt vm's paused due to storage error
Found (and caused) my problem. I’d been evaluating different settings for (default settings shown): cluster.shd-max-threads 1 cluster.shd-wait-qlength1024 and had forgotten to reset them after testing. I had them at max-thread 8 and qlength 1. It worked in that the cluster healed in approximately half the time, and was a total failure in that my cluster experienced IO pauses and at least one VM abnormal shutdown. I have 6 core processers in these boxes, and it looks like I just overloaded them to the point that normal IO wasn’t getting serviced because the self-heal was getting too much priority. I’ve reverted to the defaults for these, and things are now behaving normally, no pauses during healing at all. Moral of the story is don’t forget to undo testing settings when done, and really don’t test extreme settings in production! Back to upgrading my test cluster so I can properly abuse things like this. -Darrell > From: Darrell Budic > Subject: Re: [ovirt-users] Ovirt vm's paused due to storage error > Date: March 22, 2018 at 1:23:29 PM CDT > To: users > > I’ve also encounter something similar on my setup, ovirt 3.1.9 with a gluster > 3.12.3 storage cluster. All the storage domains in question are setup as > gluster volumes & sharded, and I’ve enabled libgfapi support in the engine. > It’s happened primarily to VMs that haven’t been restarted to switch to gfapi > yet (still have fuse mounts for these), but one or two VMs that have been > switched to gfapi mounts as well. > > I started updating the storage cluster to gluster 3.12.6 yesterday and got > more annoying/bad behavior as well. Many VMs that were “high disk use” VMs > experienced hangs, but not as storage related pauses. Instead, they hang and > their watchdogs eventually reported CPU hangs. All did eventually resume > normal operation, but it was annoying, to be sure. The Ovirt Engine also lost > contact with all of my VMs (unknown status, ? in GUI), even though it still > had contact with the hosts. My gluster cluster reported no errors, volume > status was normal, and all peers and bricks were connected. Didn’t see > anything in the gluster logs that indicated problems, but there were reports > of failed heals that eventually went away. > > Seems like something in vdsm and/or libgfapi isn’t handling the gfapi mounts > well during healing and the related locks, but I can’t tell what it is. I’ve > got two more servers in the cluster to upgrade to 3.12.6 yet, and I’ll keep > an eye on more logs while I’m doing it, will report on it after I get more > info. > > -Darrell >> From: Sahina Bose mailto:sab...@redhat.com>> >> Subject: Re: [ovirt-users] Ovirt vm's paused due to storage error >> Date: March 22, 2018 at 4:56:13 AM CDT >> To: Endre Karlson >> Cc: users >> >> Can you provide "gluster volume info" and the mount logs of the data volume >> (I assume that this hosts the vdisks for the VM's with storage error). >> >> Also vdsm.log at the corresponding time. >> >> On Fri, Mar 16, 2018 at 3:45 AM, Endre Karlson > <mailto:endre.karl...@gmail.com>> wrote: >> Hi, this is is here again and we are getting several vm's going into storage >> error in our 4 node cluster running on centos 7.4 with gluster and ovirt >> 4.2.1. >> >> Gluster version: 3.12.6 >> >> volume status >> [root@ovirt3 ~]# gluster volume status >> Status of volume: data >> Gluster process TCP Port RDMA Port Online Pid >> -- >> Brick ovirt0:/gluster/brick3/data 49152 0 Y >> 9102 >> Brick ovirt2:/gluster/brick3/data 49152 0 Y >> 28063 >> Brick ovirt3:/gluster/brick3/data 49152 0 Y >> 28379 >> Brick ovirt0:/gluster/brick4/data 49153 0 Y >> 9111 >> Brick ovirt2:/gluster/brick4/data 49153 0 Y >> 28069 >> Brick ovirt3:/gluster/brick4/data 49153 0 Y >> 28388 >> Brick ovirt0:/gluster/brick5/data 49154 0 Y >> 9120 >> Brick ovirt2:/gluster/brick5/data 49154 0 Y >> 28075 >> Brick ovirt3:/gluster/brick5/data 49154 0 Y >> 28397 >> Brick ovirt0:/gluster/brick6/data 49155 0 Y >> 9129 >> Brick ovirt2:/gluster/brick6_1/data 49155
Re: [ovirt-users] Ovirt vm's paused due to storage error
I’ve also encounter something similar on my setup, ovirt 3.1.9 with a gluster 3.12.3 storage cluster. All the storage domains in question are setup as gluster volumes & sharded, and I’ve enabled libgfapi support in the engine. It’s happened primarily to VMs that haven’t been restarted to switch to gfapi yet (still have fuse mounts for these), but one or two VMs that have been switched to gfapi mounts as well. I started updating the storage cluster to gluster 3.12.6 yesterday and got more annoying/bad behavior as well. Many VMs that were “high disk use” VMs experienced hangs, but not as storage related pauses. Instead, they hang and their watchdogs eventually reported CPU hangs. All did eventually resume normal operation, but it was annoying, to be sure. The Ovirt Engine also lost contact with all of my VMs (unknown status, ? in GUI), even though it still had contact with the hosts. My gluster cluster reported no errors, volume status was normal, and all peers and bricks were connected. Didn’t see anything in the gluster logs that indicated problems, but there were reports of failed heals that eventually went away. Seems like something in vdsm and/or libgfapi isn’t handling the gfapi mounts well during healing and the related locks, but I can’t tell what it is. I’ve got two more servers in the cluster to upgrade to 3.12.6 yet, and I’ll keep an eye on more logs while I’m doing it, will report on it after I get more info. -Darrell > From: Sahina Bose > Subject: Re: [ovirt-users] Ovirt vm's paused due to storage error > Date: March 22, 2018 at 4:56:13 AM CDT > To: Endre Karlson > Cc: users > > Can you provide "gluster volume info" and the mount logs of the data volume > (I assume that this hosts the vdisks for the VM's with storage error). > > Also vdsm.log at the corresponding time. > > On Fri, Mar 16, 2018 at 3:45 AM, Endre Karlson <mailto:endre.karl...@gmail.com>> wrote: > Hi, this is is here again and we are getting several vm's going into storage > error in our 4 node cluster running on centos 7.4 with gluster and ovirt > 4.2.1. > > Gluster version: 3.12.6 > > volume status > [root@ovirt3 ~]# gluster volume status > Status of volume: data > Gluster process TCP Port RDMA Port Online Pid > -- > Brick ovirt0:/gluster/brick3/data 49152 0 Y 9102 > Brick ovirt2:/gluster/brick3/data 49152 0 Y 28063 > Brick ovirt3:/gluster/brick3/data 49152 0 Y 28379 > Brick ovirt0:/gluster/brick4/data 49153 0 Y 9111 > Brick ovirt2:/gluster/brick4/data 49153 0 Y 28069 > Brick ovirt3:/gluster/brick4/data 49153 0 Y 28388 > Brick ovirt0:/gluster/brick5/data 49154 0 Y 9120 > Brick ovirt2:/gluster/brick5/data 49154 0 Y 28075 > Brick ovirt3:/gluster/brick5/data 49154 0 Y 28397 > Brick ovirt0:/gluster/brick6/data 49155 0 Y 9129 > Brick ovirt2:/gluster/brick6_1/data 49155 0 Y 28081 > Brick ovirt3:/gluster/brick6/data 49155 0 Y 28404 > Brick ovirt0:/gluster/brick7/data 49156 0 Y 9138 > Brick ovirt2:/gluster/brick7/data 49156 0 Y 28089 > Brick ovirt3:/gluster/brick7/data 49156 0 Y 28411 > Brick ovirt0:/gluster/brick8/data 49157 0 Y 9145 > Brick ovirt2:/gluster/brick8/data 49157 0 Y 28095 > Brick ovirt3:/gluster/brick8/data 49157 0 Y 28418 > Brick ovirt1:/gluster/brick3/data 49152 0 Y 23139 > Brick ovirt1:/gluster/brick4/data 49153 0 Y 23145 > Brick ovirt1:/gluster/brick5/data 49154 0 Y 23152 > Brick ovirt1:/gluster/brick6/data 49155 0 Y 23159 > Brick ovirt1:/gluster/brick7/data 49156 0 Y 23166 > Brick ovirt1:/gluster/brick8/data 49157 0 Y 23173 > Self-heal Daemon on localhost N/A N/AY 7757 > Bitrot Daemon on localhost N/A N/AY 7766 > Scrubber Daemon on localhostN/A N/AY 7785 > Self-heal Daemon on ovirt2 N/A N/AY 8205 > Bitrot Daemon on ovirt2 N/A N/AY 8216 > Scrubber Daemon on ovirt2 N/A N/AY
Re: [ovirt-users] Ovirt vm's paused due to storage error
Can you provide "gluster volume info" and the mount logs of the data volume (I assume that this hosts the vdisks for the VM's with storage error). Also vdsm.log at the corresponding time. On Fri, Mar 16, 2018 at 3:45 AM, Endre Karlson wrote: > Hi, this is is here again and we are getting several vm's going into > storage error in our 4 node cluster running on centos 7.4 with gluster and > ovirt 4.2.1. > > Gluster version: 3.12.6 > > volume status > [root@ovirt3 ~]# gluster volume status > Status of volume: data > Gluster process TCP Port RDMA Port Online > Pid > > -- > Brick ovirt0:/gluster/brick3/data 49152 0 Y > 9102 > Brick ovirt2:/gluster/brick3/data 49152 0 Y > 28063 > Brick ovirt3:/gluster/brick3/data 49152 0 Y > 28379 > Brick ovirt0:/gluster/brick4/data 49153 0 Y > 9111 > Brick ovirt2:/gluster/brick4/data 49153 0 Y > 28069 > Brick ovirt3:/gluster/brick4/data 49153 0 Y > 28388 > Brick ovirt0:/gluster/brick5/data 49154 0 Y > 9120 > Brick ovirt2:/gluster/brick5/data 49154 0 Y > 28075 > Brick ovirt3:/gluster/brick5/data 49154 0 Y > 28397 > Brick ovirt0:/gluster/brick6/data 49155 0 Y > 9129 > Brick ovirt2:/gluster/brick6_1/data 49155 0 Y > 28081 > Brick ovirt3:/gluster/brick6/data 49155 0 Y > 28404 > Brick ovirt0:/gluster/brick7/data 49156 0 Y > 9138 > Brick ovirt2:/gluster/brick7/data 49156 0 Y > 28089 > Brick ovirt3:/gluster/brick7/data 49156 0 Y > 28411 > Brick ovirt0:/gluster/brick8/data 49157 0 Y > 9145 > Brick ovirt2:/gluster/brick8/data 49157 0 Y > 28095 > Brick ovirt3:/gluster/brick8/data 49157 0 Y > 28418 > Brick ovirt1:/gluster/brick3/data 49152 0 Y > 23139 > Brick ovirt1:/gluster/brick4/data 49153 0 Y > 23145 > Brick ovirt1:/gluster/brick5/data 49154 0 Y > 23152 > Brick ovirt1:/gluster/brick6/data 49155 0 Y > 23159 > Brick ovirt1:/gluster/brick7/data 49156 0 Y > 23166 > Brick ovirt1:/gluster/brick8/data 49157 0 Y > 23173 > Self-heal Daemon on localhost N/A N/AY > 7757 > Bitrot Daemon on localhost N/A N/AY > 7766 > Scrubber Daemon on localhostN/A N/AY > 7785 > Self-heal Daemon on ovirt2 N/A N/AY > 8205 > Bitrot Daemon on ovirt2 N/A N/AY > 8216 > Scrubber Daemon on ovirt2 N/A N/AY > 8227 > Self-heal Daemon on ovirt0 N/A N/AY > 32665 > Bitrot Daemon on ovirt0 N/A N/AY > 32674 > Scrubber Daemon on ovirt0 N/A N/AY > 32712 > Self-heal Daemon on ovirt1 N/A N/AY > 31759 > Bitrot Daemon on ovirt1 N/A N/AY > 31768 > Scrubber Daemon on ovirt1 N/A N/AY > 31790 > > Task Status of Volume data > > -- > Task : Rebalance > ID : 62942ba3-db9e-4604-aa03-4970767f4d67 > Status : completed > > Status of volume: engine > Gluster process TCP Port RDMA Port Online > Pid > > -- > Brick ovirt0:/gluster/brick1/engine 49158 0 Y > 9155 > Brick ovirt2:/gluster/brick1/engine 49158 0 Y > 28107 > Brick ovirt3:/gluster/brick1/engine 49158 0 Y > 28427 > Self-heal Daemon on localhost N/A N/AY > 7757 > Self-heal Daemon on ovirt1 N/A N/AY > 31759 > Self-heal Daemon on ovirt0 N/A N/AY > 32665 > Self-heal Daemon on ovirt2 N/A N/AY > 8205 > > Task Status of Volume engine > > -- > There are no active volume tasks > > Status of volume: iso > Gluster process TCP Port RDMA Port Online > Pid > > -- > Brick ovirt0:/gluster/brick2/iso49159 0 Y > 9164 > Brick ovirt2:/gluster/brick2/iso49159 0 Y > 28116 > Brick ovirt3:/gluster/brick2/iso
[ovirt-users] Ovirt vm's paused due to storage error
Hi, this is is here again and we are getting several vm's going into storage error in our 4 node cluster running on centos 7.4 with gluster and ovirt 4.2.1. Gluster version: 3.12.6 volume status [root@ovirt3 ~]# gluster volume status Status of volume: data Gluster process TCP Port RDMA Port Online Pid -- Brick ovirt0:/gluster/brick3/data 49152 0 Y 9102 Brick ovirt2:/gluster/brick3/data 49152 0 Y 28063 Brick ovirt3:/gluster/brick3/data 49152 0 Y 28379 Brick ovirt0:/gluster/brick4/data 49153 0 Y 9111 Brick ovirt2:/gluster/brick4/data 49153 0 Y 28069 Brick ovirt3:/gluster/brick4/data 49153 0 Y 28388 Brick ovirt0:/gluster/brick5/data 49154 0 Y 9120 Brick ovirt2:/gluster/brick5/data 49154 0 Y 28075 Brick ovirt3:/gluster/brick5/data 49154 0 Y 28397 Brick ovirt0:/gluster/brick6/data 49155 0 Y 9129 Brick ovirt2:/gluster/brick6_1/data 49155 0 Y 28081 Brick ovirt3:/gluster/brick6/data 49155 0 Y 28404 Brick ovirt0:/gluster/brick7/data 49156 0 Y 9138 Brick ovirt2:/gluster/brick7/data 49156 0 Y 28089 Brick ovirt3:/gluster/brick7/data 49156 0 Y 28411 Brick ovirt0:/gluster/brick8/data 49157 0 Y 9145 Brick ovirt2:/gluster/brick8/data 49157 0 Y 28095 Brick ovirt3:/gluster/brick8/data 49157 0 Y 28418 Brick ovirt1:/gluster/brick3/data 49152 0 Y 23139 Brick ovirt1:/gluster/brick4/data 49153 0 Y 23145 Brick ovirt1:/gluster/brick5/data 49154 0 Y 23152 Brick ovirt1:/gluster/brick6/data 49155 0 Y 23159 Brick ovirt1:/gluster/brick7/data 49156 0 Y 23166 Brick ovirt1:/gluster/brick8/data 49157 0 Y 23173 Self-heal Daemon on localhost N/A N/AY 7757 Bitrot Daemon on localhost N/A N/AY 7766 Scrubber Daemon on localhostN/A N/AY 7785 Self-heal Daemon on ovirt2 N/A N/AY 8205 Bitrot Daemon on ovirt2 N/A N/AY 8216 Scrubber Daemon on ovirt2 N/A N/AY 8227 Self-heal Daemon on ovirt0 N/A N/AY 32665 Bitrot Daemon on ovirt0 N/A N/AY 32674 Scrubber Daemon on ovirt0 N/A N/AY 32712 Self-heal Daemon on ovirt1 N/A N/AY 31759 Bitrot Daemon on ovirt1 N/A N/AY 31768 Scrubber Daemon on ovirt1 N/A N/AY 31790 Task Status of Volume data -- Task : Rebalance ID : 62942ba3-db9e-4604-aa03-4970767f4d67 Status : completed Status of volume: engine Gluster process TCP Port RDMA Port Online Pid -- Brick ovirt0:/gluster/brick1/engine 49158 0 Y 9155 Brick ovirt2:/gluster/brick1/engine 49158 0 Y 28107 Brick ovirt3:/gluster/brick1/engine 49158 0 Y 28427 Self-heal Daemon on localhost N/A N/AY 7757 Self-heal Daemon on ovirt1 N/A N/AY 31759 Self-heal Daemon on ovirt0 N/A N/AY 32665 Self-heal Daemon on ovirt2 N/A N/AY 8205 Task Status of Volume engine -- There are no active volume tasks Status of volume: iso Gluster process TCP Port RDMA Port Online Pid -- Brick ovirt0:/gluster/brick2/iso49159 0 Y 9164 Brick ovirt2:/gluster/brick2/iso49159 0 Y 28116 Brick ovirt3:/gluster/brick2/iso49159 0 Y 28436 NFS Server on localhost 2049 0 Y 7746 Self-heal Daemon on localhost N/A N/AY 7757 NFS Server on ovirt12049 0 Y 31748 Self-heal Daemon on ovirt1 N/A N/AY 31759 NFS Server on ovirt02049 0 Y 32656 Self-heal Daemon on ovirt0 N/A N/AY 32665 NFS Server on ovir