Re: [ovirt-users] VMs freezing during heals

2015-04-06 Thread Darrell Budic
I hadn’t revisited it yet, but it is possible to use cgroups to limit 
glusterfs’s cpu usage, might help you out.

Andrew Wklau has a blog post about it: 
http://www.andrewklau.com/controlling-glusterfsd-cpu-outbreaks-with-cgroups/

Careful about how far you throttle it down, if it’s your VMs disk it’s 
rebuilding, you’ll pause it anyway I’d expect.

 On Apr 4, 2015, at 8:57 AM, Jorick Astrego j.astr...@netbulae.eu wrote:
 
 
 
 On 04/03/2015 10:04 PM, Alastair Neil wrote:
 Any follow up on this?
 
  Are there known issues using a replica 3 glsuter datastore with lvm thin 
 provisioned bricks?
 
 On 20 March 2015 at 15:22, Alastair Neil ajneil.t...@gmail.com 
 mailto:ajneil.t...@gmail.com wrote:
 CentOS 6.6
  
  vdsm-4.16.10-8.gitc937927.el6
 glusterfs-3.6.2-1.el6
 2.6.32 - 504.8.1.el6.x86_64
 
 moved to 3.6 specifically to get the snapshotting feature, hence my desire 
 to migrate to thinly provisioned lvm bricks.
 
 
 Well on the glusterfs mailinglist there have been discussions:
 
 
 3.6.2 is a major release and introduces some new features in cluster wide 
 concept. Additionally it is not stable yet.
 
 
 
 
 
 
 On 20 March 2015 at 14:57, Darrell Budic bu...@onholyground.com 
 mailto:bu...@onholyground.com wrote:
 What version of gluster are you running on these?
 
 I’ve seen high load during heals bounce my hosted engine around due to 
 overall system load, but never pause anything else. Cent 7 combo 
 storage/host systems, gluster 3.5.2.
 
 
 On Mar 20, 2015, at 9:57 AM, Alastair Neil ajneil.t...@gmail.com 
 mailto:ajneil.t...@gmail.com wrote:
 
 Pranith
 
 I have run a pretty straightforward test.  I created a two brick 50 G 
 replica volume with normal lvm bricks, and installed two servers, one 
 centos 6.6 and one centos 7.0.  I kicked off bonnie++ on both to generate 
 some file system activity and then made the volume replica 3.  I saw no 
 issues on the servers.   
 
 Not clear if this is a sufficiently rigorous test and the Volume I have had 
 issues on is a 3TB volume  with about 2TB used.
 
 -Alastair
 
 
 On 19 March 2015 at 12:30, Alastair Neil ajneil.t...@gmail.com 
 mailto:ajneil.t...@gmail.com wrote:
 I don't think I have the resources to test it meaningfully.  I have about 
 50 vms on my primary storage domain.  I might be able to set up a small 50 
 GB volume and provision 2 or 3 vms running test loads but I'm not sure it 
 would be comparable.  I'll give it a try and let you know if I see similar 
 behaviour.
 
 On 19 March 2015 at 11:34, Pranith Kumar Karampuri pkara...@redhat.com 
 mailto:pkara...@redhat.com wrote:
 Without thinly provisioned lvm.
 
 Pranith
 
 On 03/19/2015 08:01 PM, Alastair Neil wrote:
 do you mean raw partitions as bricks or simply with out thin provisioned 
 lvm?
 
 
 
 On 19 March 2015 at 00:32, Pranith Kumar Karampuri pkara...@redhat.com 
 mailto:pkara...@redhat.com wrote:
 Could you let me know if you see this problem without lvm as well?
 
 Pranith
 
 On 03/18/2015 08:25 PM, Alastair Neil wrote:
 I am in the process of replacing the bricks with thinly provisioned lvs 
 yes.
 
 
 
 On 18 March 2015 at 09:35, Pranith Kumar Karampuri pkara...@redhat.com 
 mailto:pkara...@redhat.com wrote:
 hi,
   Are you using thin-lvm based backend on which the bricks are 
 created?
 
 Pranith
 
 On 03/18/2015 02:05 AM, Alastair Neil wrote:
 I have a Ovirt cluster with 6 VM hosts and 4 gluster nodes. There are 
 two virtualisation clusters one with two nehelem nodes and one with  
 four  sandybridge nodes. My master storage domain is a GlusterFS backed 
 by a replica 3 gluster volume from 3 of the gluster nodes.  The engine 
 is a hosted engine 3.5.1 on 3 of the sandybridge nodes, with storage 
 broviede by nfs from a different gluster volume.  All the hosts are 
 CentOS 6.6.
 
  vdsm-4.16.10-8.gitc937927.el6
 glusterfs-3.6.2-1.el6
 2.6.32 - 504.8.1.el6.x86_64
 
 Problems happen when I try to add a new brick or replace a brick 
 eventually the self heal will kill the VMs. In the VM's logs I see 
 kernel hung task messages. 
 
 Mar 12 23:05:16 static1 kernel: INFO: task nginx:1736 blocked for more 
 than 120 seconds.
 Mar 12 23:05:16 static1 kernel:  Not tainted 
 2.6.32-504.3.3.el6.x86_64 #1
 Mar 12 23:05:16 static1 kernel: echo 0  
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Mar 12 23:05:16 static1 kernel: nginx D 0001 0  
 1736   1735 0x0080
 Mar 12 23:05:16 static1 kernel: 8800778b17a8 0082 
  000126c0
 Mar 12 23:05:16 static1 kernel: 88007e5c6500 880037170080 
 0006ce5c85bd9185 88007e5c64d0
 Mar 12 23:05:16 static1 kernel: 88007a614ae0 0001722b64ba 
 88007a615098 8800778b1fd8
 Mar 12 23:05:16 static1 kernel: Call Trace:
 Mar 12 23:05:16 static1 kernel: [8152a885] 
 schedule_timeout+0x215/0x2e0
 Mar 12 23:05:16 static1 kernel: [8152a503] 
 wait_for_common+0x123/0x180
 Mar 12 23:05:16 static1 kernel: [81064b90] ? 
 

Re: [ovirt-users] VMs freezing during heals

2015-04-04 Thread Jorick Astrego


On 04/03/2015 10:04 PM, Alastair Neil wrote:
 Any follow up on this?

  Are there known issues using a replica 3 glsuter datastore with lvm
 thin provisioned bricks?

 On 20 March 2015 at 15:22, Alastair Neil ajneil.t...@gmail.com
 mailto:ajneil.t...@gmail.com wrote:

 CentOS 6.6
  

  vdsm-4.16.10-8.gitc937927.el6
 glusterfs-3.6.2-1.el6
 2.6.32 - 504.8.1.el6.x86_64


 moved to 3.6 specifically to get the snapshotting feature, hence
 my desire to migrate to thinly provisioned lvm bricks.



Well on the glusterfs mailinglist there have been discussions:


 3.6.2 is a major release and introduces some new features in cluster
 wide concept. Additionally it is not stable yet.






 On 20 March 2015 at 14:57, Darrell Budic bu...@onholyground.com
 mailto:bu...@onholyground.com wrote:

 What version of gluster are you running on these?

 I’ve seen high load during heals bounce my hosted engine
 around due to overall system load, but never pause anything
 else. Cent 7 combo storage/host systems, gluster 3.5.2.


 On Mar 20, 2015, at 9:57 AM, Alastair Neil
 ajneil.t...@gmail.com mailto:ajneil.t...@gmail.com wrote:

 Pranith

 I have run a pretty straightforward test.  I created a two
 brick 50 G replica volume with normal lvm bricks, and
 installed two servers, one centos 6.6 and one centos 7.0.  I
 kicked off bonnie++ on both to generate some file system
 activity and then made the volume replica 3.  I saw no issues
 on the servers.   

 Not clear if this is a sufficiently rigorous test and the
 Volume I have had issues on is a 3TB volume  with about 2TB used.

 -Alastair


 On 19 March 2015 at 12:30, Alastair Neil
 ajneil.t...@gmail.com mailto:ajneil.t...@gmail.com wrote:

 I don't think I have the resources to test it
 meaningfully.  I have about 50 vms on my primary storage
 domain.  I might be able to set up a small 50 GB volume
 and provision 2 or 3 vms running test loads but I'm not
 sure it would be comparable.  I'll give it a try and let
 you know if I see similar behaviour.

 On 19 March 2015 at 11:34, Pranith Kumar Karampuri
 pkara...@redhat.com mailto:pkara...@redhat.com wrote:

 Without thinly provisioned lvm.

 Pranith

 On 03/19/2015 08:01 PM, Alastair Neil wrote:
 do you mean raw partitions as bricks or simply with
 out thin provisioned lvm?



 On 19 March 2015 at 00:32, Pranith Kumar Karampuri
 pkara...@redhat.com mailto:pkara...@redhat.com
 wrote:

 Could you let me know if you see this problem
 without lvm as well?

 Pranith

 On 03/18/2015 08:25 PM, Alastair Neil wrote:
 I am in the process of replacing the bricks
 with thinly provisioned lvs yes.



 On 18 March 2015 at 09:35, Pranith Kumar
 Karampuri pkara...@redhat.com
 mailto:pkara...@redhat.com wrote:

 hi,
   Are you using thin-lvm based backend
 on which the bricks are created?

 Pranith

 On 03/18/2015 02:05 AM, Alastair Neil wrote:
 I have a Ovirt cluster with 6 VM hosts and
 4 gluster nodes. There are two
 virtualisation clusters one with two
 nehelem nodes and one with  four
  sandybridge nodes. My master storage
 domain is a GlusterFS backed by a replica
 3 gluster volume from 3 of the gluster
 nodes.  The engine is a hosted engine
 3.5.1 on 3 of the sandybridge nodes, with
 storage broviede by nfs from a different
 gluster volume.  All the hosts are CentOS
 6.6.

  vdsm-4.16.10-8.gitc937927.el6
 glusterfs-3.6.2-1.el6
 2.6.32 - 504.8.1.el6.x86_64


 Problems happen when I try to add a new
 brick or replace a brick eventually the
 self heal will kill the VMs. In the VM's
 logs I see kernel hung task messages. 

 Mar 12 23:05:16 static1 kernel: INFO:
 task nginx:1736 blocked for more than
 120 seconds.
 Mar 12 23:05:16 static1 kernel:
  

Re: [ovirt-users] VMs freezing during heals

2015-04-03 Thread Alastair Neil
Any follow up on this?

 Are there known issues using a replica 3 glsuter datastore with lvm thin
provisioned bricks?

On 20 March 2015 at 15:22, Alastair Neil ajneil.t...@gmail.com wrote:

 CentOS 6.6


  vdsm-4.16.10-8.gitc937927.el6
 glusterfs-3.6.2-1.el6
 2.6.32 - 504.8.1.el6.x86_64


 moved to 3.6 specifically to get the snapshotting feature, hence my desire
 to migrate to thinly provisioned lvm bricks.




 On 20 March 2015 at 14:57, Darrell Budic bu...@onholyground.com wrote:

 What version of gluster are you running on these?

 I’ve seen high load during heals bounce my hosted engine around due to
 overall system load, but never pause anything else. Cent 7 combo
 storage/host systems, gluster 3.5.2.


 On Mar 20, 2015, at 9:57 AM, Alastair Neil ajneil.t...@gmail.com wrote:

 Pranith

 I have run a pretty straightforward test.  I created a two brick 50 G
 replica volume with normal lvm bricks, and installed two servers, one
 centos 6.6 and one centos 7.0.  I kicked off bonnie++ on both to generate
 some file system activity and then made the volume replica 3.  I saw no
 issues on the servers.

 Not clear if this is a sufficiently rigorous test and the Volume I have
 had issues on is a 3TB volume  with about 2TB used.

 -Alastair


 On 19 March 2015 at 12:30, Alastair Neil ajneil.t...@gmail.com wrote:

 I don't think I have the resources to test it meaningfully.  I have
 about 50 vms on my primary storage domain.  I might be able to set up a
 small 50 GB volume and provision 2 or 3 vms running test loads but I'm not
 sure it would be comparable.  I'll give it a try and let you know if I see
 similar behaviour.

 On 19 March 2015 at 11:34, Pranith Kumar Karampuri pkara...@redhat.com
 wrote:

  Without thinly provisioned lvm.

 Pranith

 On 03/19/2015 08:01 PM, Alastair Neil wrote:

 do you mean raw partitions as bricks or simply with out thin
 provisioned lvm?



 On 19 March 2015 at 00:32, Pranith Kumar Karampuri pkara...@redhat.com
  wrote:

  Could you let me know if you see this problem without lvm as well?

 Pranith

 On 03/18/2015 08:25 PM, Alastair Neil wrote:

 I am in the process of replacing the bricks with thinly provisioned
 lvs yes.



 On 18 March 2015 at 09:35, Pranith Kumar Karampuri 
 pkara...@redhat.com wrote:

  hi,
   Are you using thin-lvm based backend on which the bricks are
 created?

 Pranith

 On 03/18/2015 02:05 AM, Alastair Neil wrote:

  I have a Ovirt cluster with 6 VM hosts and 4 gluster nodes. There
 are two virtualisation clusters one with two nehelem nodes and one with
  four  sandybridge nodes. My master storage domain is a GlusterFS backed 
 by
 a replica 3 gluster volume from 3 of the gluster nodes.  The engine is a
 hosted engine 3.5.1 on 3 of the sandybridge nodes, with storage broviede 
 by
 nfs from a different gluster volume.  All the hosts are CentOS 6.6.

   vdsm-4.16.10-8.gitc937927.el6
 glusterfs-3.6.2-1.el6
 2.6.32 - 504.8.1.el6.x86_64


  Problems happen when I try to add a new brick or replace a brick
 eventually the self heal will kill the VMs. In the VM's logs I see kernel
 hung task messages.

  Mar 12 23:05:16 static1 kernel: INFO: task nginx:1736 blocked for
 more than 120 seconds.
 Mar 12 23:05:16 static1 kernel:  Not tainted
 2.6.32-504.3.3.el6.x86_64 #1
 Mar 12 23:05:16 static1 kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Mar 12 23:05:16 static1 kernel: nginx D 0001
 0  1736   1735 0x0080
 Mar 12 23:05:16 static1 kernel: 8800778b17a8 0082
  000126c0
 Mar 12 23:05:16 static1 kernel: 88007e5c6500 880037170080
 0006ce5c85bd9185 88007e5c64d0
 Mar 12 23:05:16 static1 kernel: 88007a614ae0 0001722b64ba
 88007a615098 8800778b1fd8
 Mar 12 23:05:16 static1 kernel: Call Trace:
 Mar 12 23:05:16 static1 kernel: [8152a885]
 schedule_timeout+0x215/0x2e0
 Mar 12 23:05:16 static1 kernel: [8152a503]
 wait_for_common+0x123/0x180
 Mar 12 23:05:16 static1 kernel: [81064b90] ?
 default_wake_function+0x0/0x20
 Mar 12 23:05:16 static1 kernel: [a0210a76] ?
 _xfs_buf_read+0x46/0x60 [xfs]
 Mar 12 23:05:16 static1 kernel: [a02063c7] ?
 xfs_trans_read_buf+0x197/0x410 [xfs]
 Mar 12 23:05:16 static1 kernel: [8152a61d]
 wait_for_completion+0x1d/0x20
 Mar 12 23:05:16 static1 kernel: [a020ff5b]
 xfs_buf_iowait+0x9b/0x100 [xfs]
 Mar 12 23:05:16 static1 kernel: [a02063c7] ?
 xfs_trans_read_buf+0x197/0x410 [xfs]
 Mar 12 23:05:16 static1 kernel: [a0210a76]
 _xfs_buf_read+0x46/0x60 [xfs]
 Mar 12 23:05:16 static1 kernel: [a0210b3b]
 xfs_buf_read+0xab/0x100 [xfs]
 Mar 12 23:05:16 static1 kernel: [a02063c7]
 xfs_trans_read_buf+0x197/0x410 [xfs]
 Mar 12 23:05:16 static1 kernel: [a01ee6a4]
 xfs_imap_to_bp+0x54/0x130 [xfs]
 Mar 12 23:05:16 static1 kernel: [a01f077b]
 xfs_iread+0x7b/0x1b0 [xfs]
 Mar 12 23:05:16 static1 kernel: [811ab77e] ?
 

Re: [ovirt-users] VMs freezing during heals

2015-03-20 Thread Alastair Neil
Pranith

I have run a pretty straightforward test.  I created a two brick 50 G
replica volume with normal lvm bricks, and installed two servers, one
centos 6.6 and one centos 7.0.  I kicked off bonnie++ on both to generate
some file system activity and then made the volume replica 3.  I saw no
issues on the servers.

Not clear if this is a sufficiently rigorous test and the Volume I have had
issues on is a 3TB volume  with about 2TB used.

-Alastair


On 19 March 2015 at 12:30, Alastair Neil ajneil.t...@gmail.com wrote:

 I don't think I have the resources to test it meaningfully.  I have about
 50 vms on my primary storage domain.  I might be able to set up a small 50
 GB volume and provision 2 or 3 vms running test loads but I'm not sure it
 would be comparable.  I'll give it a try and let you know if I see similar
 behaviour.

 On 19 March 2015 at 11:34, Pranith Kumar Karampuri pkara...@redhat.com
 wrote:

  Without thinly provisioned lvm.

 Pranith

 On 03/19/2015 08:01 PM, Alastair Neil wrote:

 do you mean raw partitions as bricks or simply with out thin provisioned
 lvm?



 On 19 March 2015 at 00:32, Pranith Kumar Karampuri pkara...@redhat.com
 wrote:

  Could you let me know if you see this problem without lvm as well?

 Pranith

 On 03/18/2015 08:25 PM, Alastair Neil wrote:

 I am in the process of replacing the bricks with thinly provisioned lvs
 yes.



 On 18 March 2015 at 09:35, Pranith Kumar Karampuri pkara...@redhat.com
 wrote:

  hi,
   Are you using thin-lvm based backend on which the bricks are
 created?

 Pranith

 On 03/18/2015 02:05 AM, Alastair Neil wrote:

  I have a Ovirt cluster with 6 VM hosts and 4 gluster nodes. There are
 two virtualisation clusters one with two nehelem nodes and one with  four
  sandybridge nodes. My master storage domain is a GlusterFS backed by a
 replica 3 gluster volume from 3 of the gluster nodes.  The engine is a
 hosted engine 3.5.1 on 3 of the sandybridge nodes, with storage broviede by
 nfs from a different gluster volume.  All the hosts are CentOS 6.6.

   vdsm-4.16.10-8.gitc937927.el6
 glusterfs-3.6.2-1.el6
 2.6.32 - 504.8.1.el6.x86_64


  Problems happen when I try to add a new brick or replace a brick
 eventually the self heal will kill the VMs. In the VM's logs I see kernel
 hung task messages.

  Mar 12 23:05:16 static1 kernel: INFO: task nginx:1736 blocked for
 more than 120 seconds.
 Mar 12 23:05:16 static1 kernel:  Not tainted
 2.6.32-504.3.3.el6.x86_64 #1
 Mar 12 23:05:16 static1 kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Mar 12 23:05:16 static1 kernel: nginx D 0001 0
  1736   1735 0x0080
 Mar 12 23:05:16 static1 kernel: 8800778b17a8 0082
  000126c0
 Mar 12 23:05:16 static1 kernel: 88007e5c6500 880037170080
 0006ce5c85bd9185 88007e5c64d0
 Mar 12 23:05:16 static1 kernel: 88007a614ae0 0001722b64ba
 88007a615098 8800778b1fd8
 Mar 12 23:05:16 static1 kernel: Call Trace:
 Mar 12 23:05:16 static1 kernel: [8152a885]
 schedule_timeout+0x215/0x2e0
 Mar 12 23:05:16 static1 kernel: [8152a503]
 wait_for_common+0x123/0x180
 Mar 12 23:05:16 static1 kernel: [81064b90] ?
 default_wake_function+0x0/0x20
 Mar 12 23:05:16 static1 kernel: [a0210a76] ?
 _xfs_buf_read+0x46/0x60 [xfs]
 Mar 12 23:05:16 static1 kernel: [a02063c7] ?
 xfs_trans_read_buf+0x197/0x410 [xfs]
 Mar 12 23:05:16 static1 kernel: [8152a61d]
 wait_for_completion+0x1d/0x20
 Mar 12 23:05:16 static1 kernel: [a020ff5b]
 xfs_buf_iowait+0x9b/0x100 [xfs]
 Mar 12 23:05:16 static1 kernel: [a02063c7] ?
 xfs_trans_read_buf+0x197/0x410 [xfs]
 Mar 12 23:05:16 static1 kernel: [a0210a76]
 _xfs_buf_read+0x46/0x60 [xfs]
 Mar 12 23:05:16 static1 kernel: [a0210b3b]
 xfs_buf_read+0xab/0x100 [xfs]
 Mar 12 23:05:16 static1 kernel: [a02063c7]
 xfs_trans_read_buf+0x197/0x410 [xfs]
 Mar 12 23:05:16 static1 kernel: [a01ee6a4]
 xfs_imap_to_bp+0x54/0x130 [xfs]
 Mar 12 23:05:16 static1 kernel: [a01f077b]
 xfs_iread+0x7b/0x1b0 [xfs]
 Mar 12 23:05:16 static1 kernel: [811ab77e] ?
 inode_init_always+0x11e/0x1c0
 Mar 12 23:05:16 static1 kernel: [a01eb5ee]
 xfs_iget+0x27e/0x6e0 [xfs]
 Mar 12 23:05:16 static1 kernel: [a01eae1d] ?
 xfs_iunlock+0x5d/0xd0 [xfs]
 Mar 12 23:05:16 static1 kernel: [a0209366]
 xfs_lookup+0xc6/0x110 [xfs]
 Mar 12 23:05:16 static1 kernel: [a0216024]
 xfs_vn_lookup+0x54/0xa0 [xfs]
 Mar 12 23:05:16 static1 kernel: [8119dc65]
 do_lookup+0x1a5/0x230
 Mar 12 23:05:16 static1 kernel: [8119e8f4]
 __link_path_walk+0x7a4/0x1000
 Mar 12 23:05:16 static1 kernel: [811738e7] ?
 cache_grow+0x217/0x320
 Mar 12 23:05:16 static1 kernel: [8119f40a]
 path_walk+0x6a/0xe0
 Mar 12 23:05:16 static1 kernel: [8119f61b]
 filename_lookup+0x6b/0xc0
 Mar 12 23:05:16 static1 kernel: [811a0747]
 

Re: [ovirt-users] VMs freezing during heals

2015-03-20 Thread Alastair Neil
CentOS 6.6


  vdsm-4.16.10-8.gitc937927.el6
 glusterfs-3.6.2-1.el6
 2.6.32 - 504.8.1.el6.x86_64


moved to 3.6 specifically to get the snapshotting feature, hence my desire
to migrate to thinly provisioned lvm bricks.




On 20 March 2015 at 14:57, Darrell Budic bu...@onholyground.com wrote:

 What version of gluster are you running on these?

 I’ve seen high load during heals bounce my hosted engine around due to
 overall system load, but never pause anything else. Cent 7 combo
 storage/host systems, gluster 3.5.2.


 On Mar 20, 2015, at 9:57 AM, Alastair Neil ajneil.t...@gmail.com wrote:

 Pranith

 I have run a pretty straightforward test.  I created a two brick 50 G
 replica volume with normal lvm bricks, and installed two servers, one
 centos 6.6 and one centos 7.0.  I kicked off bonnie++ on both to generate
 some file system activity and then made the volume replica 3.  I saw no
 issues on the servers.

 Not clear if this is a sufficiently rigorous test and the Volume I have
 had issues on is a 3TB volume  with about 2TB used.

 -Alastair


 On 19 March 2015 at 12:30, Alastair Neil ajneil.t...@gmail.com wrote:

 I don't think I have the resources to test it meaningfully.  I have about
 50 vms on my primary storage domain.  I might be able to set up a small 50
 GB volume and provision 2 or 3 vms running test loads but I'm not sure it
 would be comparable.  I'll give it a try and let you know if I see similar
 behaviour.

 On 19 March 2015 at 11:34, Pranith Kumar Karampuri pkara...@redhat.com
 wrote:

  Without thinly provisioned lvm.

 Pranith

 On 03/19/2015 08:01 PM, Alastair Neil wrote:

 do you mean raw partitions as bricks or simply with out thin provisioned
 lvm?



 On 19 March 2015 at 00:32, Pranith Kumar Karampuri pkara...@redhat.com
 wrote:

  Could you let me know if you see this problem without lvm as well?

 Pranith

 On 03/18/2015 08:25 PM, Alastair Neil wrote:

 I am in the process of replacing the bricks with thinly provisioned lvs
 yes.



 On 18 March 2015 at 09:35, Pranith Kumar Karampuri pkara...@redhat.com
  wrote:

  hi,
   Are you using thin-lvm based backend on which the bricks are
 created?

 Pranith

 On 03/18/2015 02:05 AM, Alastair Neil wrote:

  I have a Ovirt cluster with 6 VM hosts and 4 gluster nodes. There
 are two virtualisation clusters one with two nehelem nodes and one with
  four  sandybridge nodes. My master storage domain is a GlusterFS backed 
 by
 a replica 3 gluster volume from 3 of the gluster nodes.  The engine is a
 hosted engine 3.5.1 on 3 of the sandybridge nodes, with storage broviede 
 by
 nfs from a different gluster volume.  All the hosts are CentOS 6.6.

   vdsm-4.16.10-8.gitc937927.el6
 glusterfs-3.6.2-1.el6
 2.6.32 - 504.8.1.el6.x86_64


  Problems happen when I try to add a new brick or replace a brick
 eventually the self heal will kill the VMs. In the VM's logs I see kernel
 hung task messages.

  Mar 12 23:05:16 static1 kernel: INFO: task nginx:1736 blocked for
 more than 120 seconds.
 Mar 12 23:05:16 static1 kernel:  Not tainted
 2.6.32-504.3.3.el6.x86_64 #1
 Mar 12 23:05:16 static1 kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Mar 12 23:05:16 static1 kernel: nginx D 0001
 0  1736   1735 0x0080
 Mar 12 23:05:16 static1 kernel: 8800778b17a8 0082
  000126c0
 Mar 12 23:05:16 static1 kernel: 88007e5c6500 880037170080
 0006ce5c85bd9185 88007e5c64d0
 Mar 12 23:05:16 static1 kernel: 88007a614ae0 0001722b64ba
 88007a615098 8800778b1fd8
 Mar 12 23:05:16 static1 kernel: Call Trace:
 Mar 12 23:05:16 static1 kernel: [8152a885]
 schedule_timeout+0x215/0x2e0
 Mar 12 23:05:16 static1 kernel: [8152a503]
 wait_for_common+0x123/0x180
 Mar 12 23:05:16 static1 kernel: [81064b90] ?
 default_wake_function+0x0/0x20
 Mar 12 23:05:16 static1 kernel: [a0210a76] ?
 _xfs_buf_read+0x46/0x60 [xfs]
 Mar 12 23:05:16 static1 kernel: [a02063c7] ?
 xfs_trans_read_buf+0x197/0x410 [xfs]
 Mar 12 23:05:16 static1 kernel: [8152a61d]
 wait_for_completion+0x1d/0x20
 Mar 12 23:05:16 static1 kernel: [a020ff5b]
 xfs_buf_iowait+0x9b/0x100 [xfs]
 Mar 12 23:05:16 static1 kernel: [a02063c7] ?
 xfs_trans_read_buf+0x197/0x410 [xfs]
 Mar 12 23:05:16 static1 kernel: [a0210a76]
 _xfs_buf_read+0x46/0x60 [xfs]
 Mar 12 23:05:16 static1 kernel: [a0210b3b]
 xfs_buf_read+0xab/0x100 [xfs]
 Mar 12 23:05:16 static1 kernel: [a02063c7]
 xfs_trans_read_buf+0x197/0x410 [xfs]
 Mar 12 23:05:16 static1 kernel: [a01ee6a4]
 xfs_imap_to_bp+0x54/0x130 [xfs]
 Mar 12 23:05:16 static1 kernel: [a01f077b]
 xfs_iread+0x7b/0x1b0 [xfs]
 Mar 12 23:05:16 static1 kernel: [811ab77e] ?
 inode_init_always+0x11e/0x1c0
 Mar 12 23:05:16 static1 kernel: [a01eb5ee]
 xfs_iget+0x27e/0x6e0 [xfs]
 Mar 12 23:05:16 static1 kernel: [a01eae1d] ?
 xfs_iunlock+0x5d/0xd0 [xfs]
 Mar 12 

Re: [ovirt-users] VMs freezing during heals

2015-03-18 Thread Pranith Kumar Karampuri

hi,
  Are you using thin-lvm based backend on which the bricks are created?

Pranith
On 03/18/2015 02:05 AM, Alastair Neil wrote:
I have a Ovirt cluster with 6 VM hosts and 4 gluster nodes. There are 
two virtualisation clusters one with two nehelem nodes and one with 
 four  sandybridge nodes. My master storage domain is a GlusterFS 
backed by a replica 3 gluster volume from 3 of the gluster nodes.  The 
engine is a hosted engine 3.5.1 on 3 of the sandybridge nodes, with 
storage broviede by nfs from a different gluster volume.  All the 
hosts are CentOS 6.6.


 vdsm-4.16.10-8.gitc937927.el6
glusterfs-3.6.2-1.el6
2.6.32 - 504.8.1.el6.x86_64


Problems happen when I try to add a new brick or replace a brick 
eventually the self heal will kill the VMs. In the VM's logs I see 
kernel hung task messages.


Mar 12 23:05:16 static1 kernel: INFO: task nginx:1736 blocked for
more than 120 seconds.
Mar 12 23:05:16 static1 kernel:  Not tainted
2.6.32-504.3.3.el6.x86_64 #1
Mar 12 23:05:16 static1 kernel: echo 0 
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Mar 12 23:05:16 static1 kernel: nginx D 0001  
  0  1736   1735 0x0080

Mar 12 23:05:16 static1 kernel: 8800778b17a8 0082
 000126c0
Mar 12 23:05:16 static1 kernel: 88007e5c6500 880037170080
0006ce5c85bd9185 88007e5c64d0
Mar 12 23:05:16 static1 kernel: 88007a614ae0 0001722b64ba
88007a615098 8800778b1fd8
Mar 12 23:05:16 static1 kernel: Call Trace:
Mar 12 23:05:16 static1 kernel: [8152a885]
schedule_timeout+0x215/0x2e0
Mar 12 23:05:16 static1 kernel: [8152a503]
wait_for_common+0x123/0x180
Mar 12 23:05:16 static1 kernel: [81064b90] ?
default_wake_function+0x0/0x20
Mar 12 23:05:16 static1 kernel: [a0210a76] ?
_xfs_buf_read+0x46/0x60 [xfs]
Mar 12 23:05:16 static1 kernel: [a02063c7] ?
xfs_trans_read_buf+0x197/0x410 [xfs]
Mar 12 23:05:16 static1 kernel: [8152a61d]
wait_for_completion+0x1d/0x20
Mar 12 23:05:16 static1 kernel: [a020ff5b]
xfs_buf_iowait+0x9b/0x100 [xfs]
Mar 12 23:05:16 static1 kernel: [a02063c7] ?
xfs_trans_read_buf+0x197/0x410 [xfs]
Mar 12 23:05:16 static1 kernel: [a0210a76]
_xfs_buf_read+0x46/0x60 [xfs]
Mar 12 23:05:16 static1 kernel: [a0210b3b]
xfs_buf_read+0xab/0x100 [xfs]
Mar 12 23:05:16 static1 kernel: [a02063c7]
xfs_trans_read_buf+0x197/0x410 [xfs]
Mar 12 23:05:16 static1 kernel: [a01ee6a4]
xfs_imap_to_bp+0x54/0x130 [xfs]
Mar 12 23:05:16 static1 kernel: [a01f077b]
xfs_iread+0x7b/0x1b0 [xfs]
Mar 12 23:05:16 static1 kernel: [811ab77e] ?
inode_init_always+0x11e/0x1c0
Mar 12 23:05:16 static1 kernel: [a01eb5ee]
xfs_iget+0x27e/0x6e0 [xfs]
Mar 12 23:05:16 static1 kernel: [a01eae1d] ?
xfs_iunlock+0x5d/0xd0 [xfs]
Mar 12 23:05:16 static1 kernel: [a0209366]
xfs_lookup+0xc6/0x110 [xfs]
Mar 12 23:05:16 static1 kernel: [a0216024]
xfs_vn_lookup+0x54/0xa0 [xfs]
Mar 12 23:05:16 static1 kernel: [8119dc65]
do_lookup+0x1a5/0x230
Mar 12 23:05:16 static1 kernel: [8119e8f4]
__link_path_walk+0x7a4/0x1000
Mar 12 23:05:16 static1 kernel: [811738e7] ?
cache_grow+0x217/0x320
Mar 12 23:05:16 static1 kernel: [8119f40a]
path_walk+0x6a/0xe0
Mar 12 23:05:16 static1 kernel: [8119f61b]
filename_lookup+0x6b/0xc0
Mar 12 23:05:16 static1 kernel: [811a0747]
user_path_at+0x57/0xa0
Mar 12 23:05:16 static1 kernel: [a0204e74] ?
_xfs_trans_commit+0x214/0x2a0 [xfs]
Mar 12 23:05:16 static1 kernel: [a01eae3e] ?
xfs_iunlock+0x7e/0xd0 [xfs]
Mar 12 23:05:16 static1 kernel: [81193bc0]
vfs_fstatat+0x50/0xa0
Mar 12 23:05:16 static1 kernel: [811aaf5d] ?
touch_atime+0x14d/0x1a0
Mar 12 23:05:16 static1 kernel: [81193d3b]
vfs_stat+0x1b/0x20
Mar 12 23:05:16 static1 kernel: [81193d64]
sys_newstat+0x24/0x50
Mar 12 23:05:16 static1 kernel: [810e5c87] ?
audit_syscall_entry+0x1d7/0x200
Mar 12 23:05:16 static1 kernel: [810e5a7e] ?
__audit_syscall_exit+0x25e/0x290
Mar 12 23:05:16 static1 kernel: [8100b072]
system_call_fastpath+0x16/0x1b



I am wondering if my volume settings are causing this.  Can anyone 
with more knowledge take a look and let me know:


network.remote-dio: on
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
nfs.export-volumes: on
network.ping-timeout: 20
cluster.self-heal-readdir-size: 64KB
cluster.quorum-type: auto
cluster.data-self-heal-algorithm: diff

[ovirt-users] VMs freezing during heals

2015-03-17 Thread Alastair Neil
I have a Ovirt cluster with 6 VM hosts and 4 gluster nodes. There are two
virtualisation clusters one with two nehelem nodes and one with  four
 sandybridge nodes. My master storage domain is a GlusterFS backed by a
replica 3 gluster volume from 3 of the gluster nodes.  The engine is a
hosted engine 3.5.1 on 3 of the sandybridge nodes, with storage broviede by
nfs from a different gluster volume.  All the hosts are CentOS 6.6.

 vdsm-4.16.10-8.gitc937927.el6
 glusterfs-3.6.2-1.el6
 2.6.32 - 504.8.1.el6.x86_64


Problems happen when I try to add a new brick or replace a brick eventually
the self heal will kill the VMs. In the VM's logs I see kernel hung task
messages.

Mar 12 23:05:16 static1 kernel: INFO: task nginx:1736 blocked for more than
 120 seconds.
 Mar 12 23:05:16 static1 kernel:  Not tainted 2.6.32-504.3.3.el6.x86_64
 #1
 Mar 12 23:05:16 static1 kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Mar 12 23:05:16 static1 kernel: nginx D 0001 0
  1736   1735 0x0080
 Mar 12 23:05:16 static1 kernel: 8800778b17a8 0082
  000126c0
 Mar 12 23:05:16 static1 kernel: 88007e5c6500 880037170080
 0006ce5c85bd9185 88007e5c64d0
 Mar 12 23:05:16 static1 kernel: 88007a614ae0 0001722b64ba
 88007a615098 8800778b1fd8
 Mar 12 23:05:16 static1 kernel: Call Trace:
 Mar 12 23:05:16 static1 kernel: [8152a885]
 schedule_timeout+0x215/0x2e0
 Mar 12 23:05:16 static1 kernel: [8152a503]
 wait_for_common+0x123/0x180
 Mar 12 23:05:16 static1 kernel: [81064b90] ?
 default_wake_function+0x0/0x20
 Mar 12 23:05:16 static1 kernel: [a0210a76] ?
 _xfs_buf_read+0x46/0x60 [xfs]
 Mar 12 23:05:16 static1 kernel: [a02063c7] ?
 xfs_trans_read_buf+0x197/0x410 [xfs]
 Mar 12 23:05:16 static1 kernel: [8152a61d]
 wait_for_completion+0x1d/0x20
 Mar 12 23:05:16 static1 kernel: [a020ff5b]
 xfs_buf_iowait+0x9b/0x100 [xfs]
 Mar 12 23:05:16 static1 kernel: [a02063c7] ?
 xfs_trans_read_buf+0x197/0x410 [xfs]
 Mar 12 23:05:16 static1 kernel: [a0210a76]
 _xfs_buf_read+0x46/0x60 [xfs]
 Mar 12 23:05:16 static1 kernel: [a0210b3b]
 xfs_buf_read+0xab/0x100 [xfs]
 Mar 12 23:05:16 static1 kernel: [a02063c7]
 xfs_trans_read_buf+0x197/0x410 [xfs]
 Mar 12 23:05:16 static1 kernel: [a01ee6a4]
 xfs_imap_to_bp+0x54/0x130 [xfs]
 Mar 12 23:05:16 static1 kernel: [a01f077b] xfs_iread+0x7b/0x1b0
 [xfs]
 Mar 12 23:05:16 static1 kernel: [811ab77e] ?
 inode_init_always+0x11e/0x1c0
 Mar 12 23:05:16 static1 kernel: [a01eb5ee] xfs_iget+0x27e/0x6e0
 [xfs]
 Mar 12 23:05:16 static1 kernel: [a01eae1d] ?
 xfs_iunlock+0x5d/0xd0 [xfs]
 Mar 12 23:05:16 static1 kernel: [a0209366] xfs_lookup+0xc6/0x110
 [xfs]
 Mar 12 23:05:16 static1 kernel: [a0216024]
 xfs_vn_lookup+0x54/0xa0 [xfs]
 Mar 12 23:05:16 static1 kernel: [8119dc65] do_lookup+0x1a5/0x230
 Mar 12 23:05:16 static1 kernel: [8119e8f4]
 __link_path_walk+0x7a4/0x1000
 Mar 12 23:05:16 static1 kernel: [811738e7] ?
 cache_grow+0x217/0x320
 Mar 12 23:05:16 static1 kernel: [8119f40a] path_walk+0x6a/0xe0
 Mar 12 23:05:16 static1 kernel: [8119f61b]
 filename_lookup+0x6b/0xc0
 Mar 12 23:05:16 static1 kernel: [811a0747] user_path_at+0x57/0xa0
 Mar 12 23:05:16 static1 kernel: [a0204e74] ?
 _xfs_trans_commit+0x214/0x2a0 [xfs]
 Mar 12 23:05:16 static1 kernel: [a01eae3e] ?
 xfs_iunlock+0x7e/0xd0 [xfs]
 Mar 12 23:05:16 static1 kernel: [81193bc0] vfs_fstatat+0x50/0xa0
 Mar 12 23:05:16 static1 kernel: [811aaf5d] ?
 touch_atime+0x14d/0x1a0
 Mar 12 23:05:16 static1 kernel: [81193d3b] vfs_stat+0x1b/0x20
 Mar 12 23:05:16 static1 kernel: [81193d64] sys_newstat+0x24/0x50
 Mar 12 23:05:16 static1 kernel: [810e5c87] ?
 audit_syscall_entry+0x1d7/0x200
 Mar 12 23:05:16 static1 kernel: [810e5a7e] ?
 __audit_syscall_exit+0x25e/0x290
 Mar 12 23:05:16 static1 kernel: [8100b072]
 system_call_fastpath+0x16/0x1b



I am wondering if my volume settings are causing this.  Can anyone with
more knowledge take a look and let me know:

network.remote-dio: on
 performance.stat-prefetch: off
 performance.io-cache: off
 performance.read-ahead: off
 performance.quick-read: off
 nfs.export-volumes: on
 network.ping-timeout: 20
 cluster.self-heal-readdir-size: 64KB
 cluster.quorum-type: auto
 cluster.data-self-heal-algorithm: diff
 cluster.self-heal-window-size: 8
 cluster.heal-timeout: 500
 cluster.self-heal-daemon: on
 cluster.entry-self-heal: on
 cluster.data-self-heal: on
 cluster.metadata-self-heal: on
 cluster.readdir-optimize: on
 cluster.background-self-heal-count: 20
 cluster.rebalance-stats: on
 cluster.min-free-disk: 5%
 cluster.eager-lock: enable
 storage.owner-uid: 36
 storage.owner-gid: 36
 auth.allow:*
 user.cifs: disable
 cluster.server-quorum-ratio: 51%


Many Thanks,  Alastair