Re: [ovirt-users] Ovirt vm's paused due to storage error

2018-03-30 Thread Darrell Budic
Found (and caused) my problem. 

I’d been evaluating different settings for (default settings shown):
cluster.shd-max-threads 1   
cluster.shd-wait-qlength1024

and had forgotten to reset them after testing. I had them at max-thread 8 and 
qlength 1.

It worked in that the cluster healed in approximately half the time, and was a 
total failure in that my cluster experienced IO pauses and at least one VM 
abnormal shutdown. 

I have 6 core processers in these boxes, and it looks like I just overloaded 
them to the point that normal IO wasn’t getting serviced because the self-heal 
was getting too much priority. I’ve reverted to the defaults for these, and 
things are now behaving normally, no pauses during healing at all.

Moral of the story is don’t forget to undo testing settings when done, and 
really don’t test extreme settings in production!

Back to upgrading my test cluster so I can properly abuse things like this.

  -Darrell
> From: Darrell Budic 
> Subject: Re: [ovirt-users] Ovirt vm's paused due to storage error
> Date: March 22, 2018 at 1:23:29 PM CDT
> To: users
> 
> I’ve also encounter something similar on my setup, ovirt 3.1.9 with a gluster 
> 3.12.3 storage cluster. All the storage domains in question are setup as 
> gluster volumes & sharded, and I’ve enabled libgfapi support in the engine. 
> It’s happened primarily to VMs that haven’t been restarted to switch to gfapi 
> yet (still have fuse mounts for these), but one or two VMs that have been 
> switched to gfapi mounts as well.
> 
> I started updating the storage cluster to gluster 3.12.6 yesterday and got 
> more annoying/bad behavior as well. Many VMs that were “high disk use” VMs 
> experienced hangs, but not as storage related pauses. Instead, they hang and 
> their watchdogs eventually reported CPU hangs. All did eventually resume 
> normal operation, but it was annoying, to be sure. The Ovirt Engine also lost 
> contact with all of my VMs (unknown status, ? in GUI), even though it still 
> had contact with the hosts. My gluster cluster reported no errors, volume 
> status was normal, and all peers and bricks were connected. Didn’t see 
> anything in the gluster logs that indicated problems, but there were reports 
> of failed heals that eventually went away. 
> 
> Seems like something in vdsm and/or libgfapi isn’t handling the gfapi mounts 
> well during healing and the related locks, but I can’t tell what it is. I’ve 
> got two more servers in the cluster to upgrade to 3.12.6 yet, and I’ll keep 
> an eye on more logs while I’m doing it, will report on it after I get more 
> info.
> 
>   -Darrell
>> From: Sahina Bose mailto:sab...@redhat.com>>
>> Subject: Re: [ovirt-users] Ovirt vm's paused due to storage error
>> Date: March 22, 2018 at 4:56:13 AM CDT
>> To: Endre Karlson
>> Cc: users
>> 
>> Can you provide "gluster volume info" and  the mount logs of the data volume 
>> (I assume that this hosts the vdisks for the VM's with storage error).
>> 
>> Also vdsm.log at the corresponding time.
>> 
>> On Fri, Mar 16, 2018 at 3:45 AM, Endre Karlson > <mailto:endre.karl...@gmail.com>> wrote:
>> Hi, this is is here again and we are getting several vm's going into storage 
>> error in our 4 node cluster running on centos 7.4 with gluster and ovirt 
>> 4.2.1.
>> 
>> Gluster version: 3.12.6
>> 
>> volume status
>> [root@ovirt3 ~]# gluster volume status
>> Status of volume: data
>> Gluster process TCP Port  RDMA Port  Online  Pid
>> --
>> Brick ovirt0:/gluster/brick3/data   49152 0  Y   
>> 9102 
>> Brick ovirt2:/gluster/brick3/data   49152 0  Y   
>> 28063
>> Brick ovirt3:/gluster/brick3/data   49152 0  Y   
>> 28379
>> Brick ovirt0:/gluster/brick4/data   49153 0  Y   
>> 9111 
>> Brick ovirt2:/gluster/brick4/data   49153 0  Y   
>> 28069
>> Brick ovirt3:/gluster/brick4/data   49153 0  Y   
>> 28388
>> Brick ovirt0:/gluster/brick5/data   49154 0  Y   
>> 9120 
>> Brick ovirt2:/gluster/brick5/data   49154 0  Y   
>> 28075
>> Brick ovirt3:/gluster/brick5/data   49154 0  Y   
>> 28397
>> Brick ovirt0:/gluster/brick6/data   49155 0  Y   
>> 9129 
>> Brick ovirt2:/gluster/brick6_1/data 49155   

Re: [ovirt-users] Ovirt vm's paused due to storage error

2018-03-22 Thread Darrell Budic
I’ve also encounter something similar on my setup, ovirt 3.1.9 with a gluster 
3.12.3 storage cluster. All the storage domains in question are setup as 
gluster volumes & sharded, and I’ve enabled libgfapi support in the engine. 
It’s happened primarily to VMs that haven’t been restarted to switch to gfapi 
yet (still have fuse mounts for these), but one or two VMs that have been 
switched to gfapi mounts as well.

I started updating the storage cluster to gluster 3.12.6 yesterday and got more 
annoying/bad behavior as well. Many VMs that were “high disk use” VMs 
experienced hangs, but not as storage related pauses. Instead, they hang and 
their watchdogs eventually reported CPU hangs. All did eventually resume normal 
operation, but it was annoying, to be sure. The Ovirt Engine also lost contact 
with all of my VMs (unknown status, ? in GUI), even though it still had contact 
with the hosts. My gluster cluster reported no errors, volume status was 
normal, and all peers and bricks were connected. Didn’t see anything in the 
gluster logs that indicated problems, but there were reports of failed heals 
that eventually went away. 

Seems like something in vdsm and/or libgfapi isn’t handling the gfapi mounts 
well during healing and the related locks, but I can’t tell what it is. I’ve 
got two more servers in the cluster to upgrade to 3.12.6 yet, and I’ll keep an 
eye on more logs while I’m doing it, will report on it after I get more info.

  -Darrell
> From: Sahina Bose 
> Subject: Re: [ovirt-users] Ovirt vm's paused due to storage error
> Date: March 22, 2018 at 4:56:13 AM CDT
> To: Endre Karlson
> Cc: users
> 
> Can you provide "gluster volume info" and  the mount logs of the data volume 
> (I assume that this hosts the vdisks for the VM's with storage error).
> 
> Also vdsm.log at the corresponding time.
> 
> On Fri, Mar 16, 2018 at 3:45 AM, Endre Karlson  <mailto:endre.karl...@gmail.com>> wrote:
> Hi, this is is here again and we are getting several vm's going into storage 
> error in our 4 node cluster running on centos 7.4 with gluster and ovirt 
> 4.2.1.
> 
> Gluster version: 3.12.6
> 
> volume status
> [root@ovirt3 ~]# gluster volume status
> Status of volume: data
> Gluster process TCP Port  RDMA Port  Online  Pid
> --
> Brick ovirt0:/gluster/brick3/data   49152 0  Y   9102 
> Brick ovirt2:/gluster/brick3/data   49152 0  Y   28063
> Brick ovirt3:/gluster/brick3/data   49152 0  Y   28379
> Brick ovirt0:/gluster/brick4/data   49153 0  Y   9111 
> Brick ovirt2:/gluster/brick4/data   49153 0  Y   28069
> Brick ovirt3:/gluster/brick4/data   49153 0  Y   28388
> Brick ovirt0:/gluster/brick5/data   49154 0  Y   9120 
> Brick ovirt2:/gluster/brick5/data   49154 0  Y   28075
> Brick ovirt3:/gluster/brick5/data   49154 0  Y   28397
> Brick ovirt0:/gluster/brick6/data   49155 0  Y   9129 
> Brick ovirt2:/gluster/brick6_1/data 49155 0  Y   28081
> Brick ovirt3:/gluster/brick6/data   49155 0  Y   28404
> Brick ovirt0:/gluster/brick7/data   49156 0  Y   9138 
> Brick ovirt2:/gluster/brick7/data   49156 0  Y   28089
> Brick ovirt3:/gluster/brick7/data   49156 0  Y   28411
> Brick ovirt0:/gluster/brick8/data   49157 0  Y   9145 
> Brick ovirt2:/gluster/brick8/data   49157 0  Y   28095
> Brick ovirt3:/gluster/brick8/data   49157 0  Y   28418
> Brick ovirt1:/gluster/brick3/data   49152 0  Y   23139
> Brick ovirt1:/gluster/brick4/data   49153 0  Y   23145
> Brick ovirt1:/gluster/brick5/data   49154 0  Y   23152
> Brick ovirt1:/gluster/brick6/data   49155 0  Y   23159
> Brick ovirt1:/gluster/brick7/data   49156 0  Y   23166
> Brick ovirt1:/gluster/brick8/data   49157 0  Y   23173
> Self-heal Daemon on localhost   N/A   N/AY   7757 
> Bitrot Daemon on localhost  N/A   N/AY   7766 
> Scrubber Daemon on localhostN/A   N/AY   7785 
> Self-heal Daemon on ovirt2  N/A   N/AY   8205 
> Bitrot Daemon on ovirt2 N/A   N/AY   8216 
> Scrubber Daemon on ovirt2   N/A   N/AY 

Re: [ovirt-users] Ovirt vm's paused due to storage error

2018-03-22 Thread Sahina Bose
Can you provide "gluster volume info" and  the mount logs of the data
volume (I assume that this hosts the vdisks for the VM's with storage
error).

Also vdsm.log at the corresponding time.

On Fri, Mar 16, 2018 at 3:45 AM, Endre Karlson 
wrote:

> Hi, this is is here again and we are getting several vm's going into
> storage error in our 4 node cluster running on centos 7.4 with gluster and
> ovirt 4.2.1.
>
> Gluster version: 3.12.6
>
> volume status
> [root@ovirt3 ~]# gluster volume status
> Status of volume: data
> Gluster process TCP Port  RDMA Port  Online
> Pid
> 
> --
> Brick ovirt0:/gluster/brick3/data   49152 0  Y
>  9102
> Brick ovirt2:/gluster/brick3/data   49152 0  Y
>  28063
> Brick ovirt3:/gluster/brick3/data   49152 0  Y
>  28379
> Brick ovirt0:/gluster/brick4/data   49153 0  Y
>  9111
> Brick ovirt2:/gluster/brick4/data   49153 0  Y
>  28069
> Brick ovirt3:/gluster/brick4/data   49153 0  Y
>  28388
> Brick ovirt0:/gluster/brick5/data   49154 0  Y
>  9120
> Brick ovirt2:/gluster/brick5/data   49154 0  Y
>  28075
> Brick ovirt3:/gluster/brick5/data   49154 0  Y
>  28397
> Brick ovirt0:/gluster/brick6/data   49155 0  Y
>  9129
> Brick ovirt2:/gluster/brick6_1/data 49155 0  Y
>  28081
> Brick ovirt3:/gluster/brick6/data   49155 0  Y
>  28404
> Brick ovirt0:/gluster/brick7/data   49156 0  Y
>  9138
> Brick ovirt2:/gluster/brick7/data   49156 0  Y
>  28089
> Brick ovirt3:/gluster/brick7/data   49156 0  Y
>  28411
> Brick ovirt0:/gluster/brick8/data   49157 0  Y
>  9145
> Brick ovirt2:/gluster/brick8/data   49157 0  Y
>  28095
> Brick ovirt3:/gluster/brick8/data   49157 0  Y
>  28418
> Brick ovirt1:/gluster/brick3/data   49152 0  Y
>  23139
> Brick ovirt1:/gluster/brick4/data   49153 0  Y
>  23145
> Brick ovirt1:/gluster/brick5/data   49154 0  Y
>  23152
> Brick ovirt1:/gluster/brick6/data   49155 0  Y
>  23159
> Brick ovirt1:/gluster/brick7/data   49156 0  Y
>  23166
> Brick ovirt1:/gluster/brick8/data   49157 0  Y
>  23173
> Self-heal Daemon on localhost   N/A   N/AY
>  7757
> Bitrot Daemon on localhost  N/A   N/AY
>  7766
> Scrubber Daemon on localhostN/A   N/AY
>  7785
> Self-heal Daemon on ovirt2  N/A   N/AY
>  8205
> Bitrot Daemon on ovirt2 N/A   N/AY
>  8216
> Scrubber Daemon on ovirt2   N/A   N/AY
>  8227
> Self-heal Daemon on ovirt0  N/A   N/AY
>  32665
> Bitrot Daemon on ovirt0 N/A   N/AY
>  32674
> Scrubber Daemon on ovirt0   N/A   N/AY
>  32712
> Self-heal Daemon on ovirt1  N/A   N/AY
>  31759
> Bitrot Daemon on ovirt1 N/A   N/AY
>  31768
> Scrubber Daemon on ovirt1   N/A   N/AY
>  31790
>
> Task Status of Volume data
> 
> --
> Task : Rebalance
> ID   : 62942ba3-db9e-4604-aa03-4970767f4d67
> Status   : completed
>
> Status of volume: engine
> Gluster process TCP Port  RDMA Port  Online
> Pid
> 
> --
> Brick ovirt0:/gluster/brick1/engine 49158 0  Y
>  9155
> Brick ovirt2:/gluster/brick1/engine 49158 0  Y
>  28107
> Brick ovirt3:/gluster/brick1/engine 49158 0  Y
>  28427
> Self-heal Daemon on localhost   N/A   N/AY
>  7757
> Self-heal Daemon on ovirt1  N/A   N/AY
>  31759
> Self-heal Daemon on ovirt0  N/A   N/AY
>  32665
> Self-heal Daemon on ovirt2  N/A   N/AY
>  8205
>
> Task Status of Volume engine
> 
> --
> There are no active volume tasks
>
> Status of volume: iso
> Gluster process TCP Port  RDMA Port  Online
> Pid
> 
> --
> Brick ovirt0:/gluster/brick2/iso49159 0  Y
>  9164
> Brick ovirt2:/gluster/brick2/iso49159 0  Y
>  28116
> Brick ovirt3:/gluster/brick2/iso   

[ovirt-users] Ovirt vm's paused due to storage error

2018-03-15 Thread Endre Karlson
Hi, this is is here again and we are getting several vm's going into
storage error in our 4 node cluster running on centos 7.4 with gluster and
ovirt 4.2.1.

Gluster version: 3.12.6

volume status
[root@ovirt3 ~]# gluster volume status
Status of volume: data
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick ovirt0:/gluster/brick3/data   49152 0  Y
 9102
Brick ovirt2:/gluster/brick3/data   49152 0  Y
 28063
Brick ovirt3:/gluster/brick3/data   49152 0  Y
 28379
Brick ovirt0:/gluster/brick4/data   49153 0  Y
 9111
Brick ovirt2:/gluster/brick4/data   49153 0  Y
 28069
Brick ovirt3:/gluster/brick4/data   49153 0  Y
 28388
Brick ovirt0:/gluster/brick5/data   49154 0  Y
 9120
Brick ovirt2:/gluster/brick5/data   49154 0  Y
 28075
Brick ovirt3:/gluster/brick5/data   49154 0  Y
 28397
Brick ovirt0:/gluster/brick6/data   49155 0  Y
 9129
Brick ovirt2:/gluster/brick6_1/data 49155 0  Y
 28081
Brick ovirt3:/gluster/brick6/data   49155 0  Y
 28404
Brick ovirt0:/gluster/brick7/data   49156 0  Y
 9138
Brick ovirt2:/gluster/brick7/data   49156 0  Y
 28089
Brick ovirt3:/gluster/brick7/data   49156 0  Y
 28411
Brick ovirt0:/gluster/brick8/data   49157 0  Y
 9145
Brick ovirt2:/gluster/brick8/data   49157 0  Y
 28095
Brick ovirt3:/gluster/brick8/data   49157 0  Y
 28418
Brick ovirt1:/gluster/brick3/data   49152 0  Y
 23139
Brick ovirt1:/gluster/brick4/data   49153 0  Y
 23145
Brick ovirt1:/gluster/brick5/data   49154 0  Y
 23152
Brick ovirt1:/gluster/brick6/data   49155 0  Y
 23159
Brick ovirt1:/gluster/brick7/data   49156 0  Y
 23166
Brick ovirt1:/gluster/brick8/data   49157 0  Y
 23173
Self-heal Daemon on localhost   N/A   N/AY
 7757
Bitrot Daemon on localhost  N/A   N/AY
 7766
Scrubber Daemon on localhostN/A   N/AY
 7785
Self-heal Daemon on ovirt2  N/A   N/AY
 8205
Bitrot Daemon on ovirt2 N/A   N/AY
 8216
Scrubber Daemon on ovirt2   N/A   N/AY
 8227
Self-heal Daemon on ovirt0  N/A   N/AY
 32665
Bitrot Daemon on ovirt0 N/A   N/AY
 32674
Scrubber Daemon on ovirt0   N/A   N/AY
 32712
Self-heal Daemon on ovirt1  N/A   N/AY
 31759
Bitrot Daemon on ovirt1 N/A   N/AY
 31768
Scrubber Daemon on ovirt1   N/A   N/AY
 31790

Task Status of Volume data
--
Task : Rebalance
ID   : 62942ba3-db9e-4604-aa03-4970767f4d67
Status   : completed

Status of volume: engine
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick ovirt0:/gluster/brick1/engine 49158 0  Y
 9155
Brick ovirt2:/gluster/brick1/engine 49158 0  Y
 28107
Brick ovirt3:/gluster/brick1/engine 49158 0  Y
 28427
Self-heal Daemon on localhost   N/A   N/AY
 7757
Self-heal Daemon on ovirt1  N/A   N/AY
 31759
Self-heal Daemon on ovirt0  N/A   N/AY
 32665
Self-heal Daemon on ovirt2  N/A   N/AY
 8205

Task Status of Volume engine
--
There are no active volume tasks

Status of volume: iso
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick ovirt0:/gluster/brick2/iso49159 0  Y
 9164
Brick ovirt2:/gluster/brick2/iso49159 0  Y
 28116
Brick ovirt3:/gluster/brick2/iso49159 0  Y
 28436
NFS Server on localhost 2049  0  Y
 7746
Self-heal Daemon on localhost   N/A   N/AY
 7757
NFS Server on ovirt12049  0  Y
 31748
Self-heal Daemon on ovirt1  N/A   N/AY
 31759
NFS Server on ovirt02049  0  Y
 32656
Self-heal Daemon on ovirt0  N/A   N/AY
 32665
NFS Server on ovir