Re: [ovirt-users] VMs freezing during heals
I hadn’t revisited it yet, but it is possible to use cgroups to limit glusterfs’s cpu usage, might help you out. Andrew Wklau has a blog post about it: http://www.andrewklau.com/controlling-glusterfsd-cpu-outbreaks-with-cgroups/ Careful about how far you throttle it down, if it’s your VMs disk it’s rebuilding, you’ll pause it anyway I’d expect. On Apr 4, 2015, at 8:57 AM, Jorick Astrego j.astr...@netbulae.eu wrote: On 04/03/2015 10:04 PM, Alastair Neil wrote: Any follow up on this? Are there known issues using a replica 3 glsuter datastore with lvm thin provisioned bricks? On 20 March 2015 at 15:22, Alastair Neil ajneil.t...@gmail.com mailto:ajneil.t...@gmail.com wrote: CentOS 6.6 vdsm-4.16.10-8.gitc937927.el6 glusterfs-3.6.2-1.el6 2.6.32 - 504.8.1.el6.x86_64 moved to 3.6 specifically to get the snapshotting feature, hence my desire to migrate to thinly provisioned lvm bricks. Well on the glusterfs mailinglist there have been discussions: 3.6.2 is a major release and introduces some new features in cluster wide concept. Additionally it is not stable yet. On 20 March 2015 at 14:57, Darrell Budic bu...@onholyground.com mailto:bu...@onholyground.com wrote: What version of gluster are you running on these? I’ve seen high load during heals bounce my hosted engine around due to overall system load, but never pause anything else. Cent 7 combo storage/host systems, gluster 3.5.2. On Mar 20, 2015, at 9:57 AM, Alastair Neil ajneil.t...@gmail.com mailto:ajneil.t...@gmail.com wrote: Pranith I have run a pretty straightforward test. I created a two brick 50 G replica volume with normal lvm bricks, and installed two servers, one centos 6.6 and one centos 7.0. I kicked off bonnie++ on both to generate some file system activity and then made the volume replica 3. I saw no issues on the servers. Not clear if this is a sufficiently rigorous test and the Volume I have had issues on is a 3TB volume with about 2TB used. -Alastair On 19 March 2015 at 12:30, Alastair Neil ajneil.t...@gmail.com mailto:ajneil.t...@gmail.com wrote: I don't think I have the resources to test it meaningfully. I have about 50 vms on my primary storage domain. I might be able to set up a small 50 GB volume and provision 2 or 3 vms running test loads but I'm not sure it would be comparable. I'll give it a try and let you know if I see similar behaviour. On 19 March 2015 at 11:34, Pranith Kumar Karampuri pkara...@redhat.com mailto:pkara...@redhat.com wrote: Without thinly provisioned lvm. Pranith On 03/19/2015 08:01 PM, Alastair Neil wrote: do you mean raw partitions as bricks or simply with out thin provisioned lvm? On 19 March 2015 at 00:32, Pranith Kumar Karampuri pkara...@redhat.com mailto:pkara...@redhat.com wrote: Could you let me know if you see this problem without lvm as well? Pranith On 03/18/2015 08:25 PM, Alastair Neil wrote: I am in the process of replacing the bricks with thinly provisioned lvs yes. On 18 March 2015 at 09:35, Pranith Kumar Karampuri pkara...@redhat.com mailto:pkara...@redhat.com wrote: hi, Are you using thin-lvm based backend on which the bricks are created? Pranith On 03/18/2015 02:05 AM, Alastair Neil wrote: I have a Ovirt cluster with 6 VM hosts and 4 gluster nodes. There are two virtualisation clusters one with two nehelem nodes and one with four sandybridge nodes. My master storage domain is a GlusterFS backed by a replica 3 gluster volume from 3 of the gluster nodes. The engine is a hosted engine 3.5.1 on 3 of the sandybridge nodes, with storage broviede by nfs from a different gluster volume. All the hosts are CentOS 6.6. vdsm-4.16.10-8.gitc937927.el6 glusterfs-3.6.2-1.el6 2.6.32 - 504.8.1.el6.x86_64 Problems happen when I try to add a new brick or replace a brick eventually the self heal will kill the VMs. In the VM's logs I see kernel hung task messages. Mar 12 23:05:16 static1 kernel: INFO: task nginx:1736 blocked for more than 120 seconds. Mar 12 23:05:16 static1 kernel: Not tainted 2.6.32-504.3.3.el6.x86_64 #1 Mar 12 23:05:16 static1 kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Mar 12 23:05:16 static1 kernel: nginx D 0001 0 1736 1735 0x0080 Mar 12 23:05:16 static1 kernel: 8800778b17a8 0082 000126c0 Mar 12 23:05:16 static1 kernel: 88007e5c6500 880037170080 0006ce5c85bd9185 88007e5c64d0 Mar 12 23:05:16 static1 kernel: 88007a614ae0 0001722b64ba 88007a615098 8800778b1fd8 Mar 12 23:05:16 static1 kernel: Call Trace: Mar 12 23:05:16 static1 kernel: [8152a885] schedule_timeout+0x215/0x2e0 Mar 12 23:05:16 static1 kernel: [8152a503] wait_for_common+0x123/0x180 Mar 12 23:05:16 static1 kernel: [81064b90] ?
Re: [ovirt-users] VMs freezing during heals
On 04/03/2015 10:04 PM, Alastair Neil wrote: Any follow up on this? Are there known issues using a replica 3 glsuter datastore with lvm thin provisioned bricks? On 20 March 2015 at 15:22, Alastair Neil ajneil.t...@gmail.com mailto:ajneil.t...@gmail.com wrote: CentOS 6.6 vdsm-4.16.10-8.gitc937927.el6 glusterfs-3.6.2-1.el6 2.6.32 - 504.8.1.el6.x86_64 moved to 3.6 specifically to get the snapshotting feature, hence my desire to migrate to thinly provisioned lvm bricks. Well on the glusterfs mailinglist there have been discussions: 3.6.2 is a major release and introduces some new features in cluster wide concept. Additionally it is not stable yet. On 20 March 2015 at 14:57, Darrell Budic bu...@onholyground.com mailto:bu...@onholyground.com wrote: What version of gluster are you running on these? I’ve seen high load during heals bounce my hosted engine around due to overall system load, but never pause anything else. Cent 7 combo storage/host systems, gluster 3.5.2. On Mar 20, 2015, at 9:57 AM, Alastair Neil ajneil.t...@gmail.com mailto:ajneil.t...@gmail.com wrote: Pranith I have run a pretty straightforward test. I created a two brick 50 G replica volume with normal lvm bricks, and installed two servers, one centos 6.6 and one centos 7.0. I kicked off bonnie++ on both to generate some file system activity and then made the volume replica 3. I saw no issues on the servers. Not clear if this is a sufficiently rigorous test and the Volume I have had issues on is a 3TB volume with about 2TB used. -Alastair On 19 March 2015 at 12:30, Alastair Neil ajneil.t...@gmail.com mailto:ajneil.t...@gmail.com wrote: I don't think I have the resources to test it meaningfully. I have about 50 vms on my primary storage domain. I might be able to set up a small 50 GB volume and provision 2 or 3 vms running test loads but I'm not sure it would be comparable. I'll give it a try and let you know if I see similar behaviour. On 19 March 2015 at 11:34, Pranith Kumar Karampuri pkara...@redhat.com mailto:pkara...@redhat.com wrote: Without thinly provisioned lvm. Pranith On 03/19/2015 08:01 PM, Alastair Neil wrote: do you mean raw partitions as bricks or simply with out thin provisioned lvm? On 19 March 2015 at 00:32, Pranith Kumar Karampuri pkara...@redhat.com mailto:pkara...@redhat.com wrote: Could you let me know if you see this problem without lvm as well? Pranith On 03/18/2015 08:25 PM, Alastair Neil wrote: I am in the process of replacing the bricks with thinly provisioned lvs yes. On 18 March 2015 at 09:35, Pranith Kumar Karampuri pkara...@redhat.com mailto:pkara...@redhat.com wrote: hi, Are you using thin-lvm based backend on which the bricks are created? Pranith On 03/18/2015 02:05 AM, Alastair Neil wrote: I have a Ovirt cluster with 6 VM hosts and 4 gluster nodes. There are two virtualisation clusters one with two nehelem nodes and one with four sandybridge nodes. My master storage domain is a GlusterFS backed by a replica 3 gluster volume from 3 of the gluster nodes. The engine is a hosted engine 3.5.1 on 3 of the sandybridge nodes, with storage broviede by nfs from a different gluster volume. All the hosts are CentOS 6.6. vdsm-4.16.10-8.gitc937927.el6 glusterfs-3.6.2-1.el6 2.6.32 - 504.8.1.el6.x86_64 Problems happen when I try to add a new brick or replace a brick eventually the self heal will kill the VMs. In the VM's logs I see kernel hung task messages. Mar 12 23:05:16 static1 kernel: INFO: task nginx:1736 blocked for more than 120 seconds. Mar 12 23:05:16 static1 kernel:
Re: [ovirt-users] VMs freezing during heals
Any follow up on this? Are there known issues using a replica 3 glsuter datastore with lvm thin provisioned bricks? On 20 March 2015 at 15:22, Alastair Neil ajneil.t...@gmail.com wrote: CentOS 6.6 vdsm-4.16.10-8.gitc937927.el6 glusterfs-3.6.2-1.el6 2.6.32 - 504.8.1.el6.x86_64 moved to 3.6 specifically to get the snapshotting feature, hence my desire to migrate to thinly provisioned lvm bricks. On 20 March 2015 at 14:57, Darrell Budic bu...@onholyground.com wrote: What version of gluster are you running on these? I’ve seen high load during heals bounce my hosted engine around due to overall system load, but never pause anything else. Cent 7 combo storage/host systems, gluster 3.5.2. On Mar 20, 2015, at 9:57 AM, Alastair Neil ajneil.t...@gmail.com wrote: Pranith I have run a pretty straightforward test. I created a two brick 50 G replica volume with normal lvm bricks, and installed two servers, one centos 6.6 and one centos 7.0. I kicked off bonnie++ on both to generate some file system activity and then made the volume replica 3. I saw no issues on the servers. Not clear if this is a sufficiently rigorous test and the Volume I have had issues on is a 3TB volume with about 2TB used. -Alastair On 19 March 2015 at 12:30, Alastair Neil ajneil.t...@gmail.com wrote: I don't think I have the resources to test it meaningfully. I have about 50 vms on my primary storage domain. I might be able to set up a small 50 GB volume and provision 2 or 3 vms running test loads but I'm not sure it would be comparable. I'll give it a try and let you know if I see similar behaviour. On 19 March 2015 at 11:34, Pranith Kumar Karampuri pkara...@redhat.com wrote: Without thinly provisioned lvm. Pranith On 03/19/2015 08:01 PM, Alastair Neil wrote: do you mean raw partitions as bricks or simply with out thin provisioned lvm? On 19 March 2015 at 00:32, Pranith Kumar Karampuri pkara...@redhat.com wrote: Could you let me know if you see this problem without lvm as well? Pranith On 03/18/2015 08:25 PM, Alastair Neil wrote: I am in the process of replacing the bricks with thinly provisioned lvs yes. On 18 March 2015 at 09:35, Pranith Kumar Karampuri pkara...@redhat.com wrote: hi, Are you using thin-lvm based backend on which the bricks are created? Pranith On 03/18/2015 02:05 AM, Alastair Neil wrote: I have a Ovirt cluster with 6 VM hosts and 4 gluster nodes. There are two virtualisation clusters one with two nehelem nodes and one with four sandybridge nodes. My master storage domain is a GlusterFS backed by a replica 3 gluster volume from 3 of the gluster nodes. The engine is a hosted engine 3.5.1 on 3 of the sandybridge nodes, with storage broviede by nfs from a different gluster volume. All the hosts are CentOS 6.6. vdsm-4.16.10-8.gitc937927.el6 glusterfs-3.6.2-1.el6 2.6.32 - 504.8.1.el6.x86_64 Problems happen when I try to add a new brick or replace a brick eventually the self heal will kill the VMs. In the VM's logs I see kernel hung task messages. Mar 12 23:05:16 static1 kernel: INFO: task nginx:1736 blocked for more than 120 seconds. Mar 12 23:05:16 static1 kernel: Not tainted 2.6.32-504.3.3.el6.x86_64 #1 Mar 12 23:05:16 static1 kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Mar 12 23:05:16 static1 kernel: nginx D 0001 0 1736 1735 0x0080 Mar 12 23:05:16 static1 kernel: 8800778b17a8 0082 000126c0 Mar 12 23:05:16 static1 kernel: 88007e5c6500 880037170080 0006ce5c85bd9185 88007e5c64d0 Mar 12 23:05:16 static1 kernel: 88007a614ae0 0001722b64ba 88007a615098 8800778b1fd8 Mar 12 23:05:16 static1 kernel: Call Trace: Mar 12 23:05:16 static1 kernel: [8152a885] schedule_timeout+0x215/0x2e0 Mar 12 23:05:16 static1 kernel: [8152a503] wait_for_common+0x123/0x180 Mar 12 23:05:16 static1 kernel: [81064b90] ? default_wake_function+0x0/0x20 Mar 12 23:05:16 static1 kernel: [a0210a76] ? _xfs_buf_read+0x46/0x60 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] ? xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [8152a61d] wait_for_completion+0x1d/0x20 Mar 12 23:05:16 static1 kernel: [a020ff5b] xfs_buf_iowait+0x9b/0x100 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] ? xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [a0210a76] _xfs_buf_read+0x46/0x60 [xfs] Mar 12 23:05:16 static1 kernel: [a0210b3b] xfs_buf_read+0xab/0x100 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [a01ee6a4] xfs_imap_to_bp+0x54/0x130 [xfs] Mar 12 23:05:16 static1 kernel: [a01f077b] xfs_iread+0x7b/0x1b0 [xfs] Mar 12 23:05:16 static1 kernel: [811ab77e] ?
Re: [ovirt-users] VMs freezing during heals
Pranith I have run a pretty straightforward test. I created a two brick 50 G replica volume with normal lvm bricks, and installed two servers, one centos 6.6 and one centos 7.0. I kicked off bonnie++ on both to generate some file system activity and then made the volume replica 3. I saw no issues on the servers. Not clear if this is a sufficiently rigorous test and the Volume I have had issues on is a 3TB volume with about 2TB used. -Alastair On 19 March 2015 at 12:30, Alastair Neil ajneil.t...@gmail.com wrote: I don't think I have the resources to test it meaningfully. I have about 50 vms on my primary storage domain. I might be able to set up a small 50 GB volume and provision 2 or 3 vms running test loads but I'm not sure it would be comparable. I'll give it a try and let you know if I see similar behaviour. On 19 March 2015 at 11:34, Pranith Kumar Karampuri pkara...@redhat.com wrote: Without thinly provisioned lvm. Pranith On 03/19/2015 08:01 PM, Alastair Neil wrote: do you mean raw partitions as bricks or simply with out thin provisioned lvm? On 19 March 2015 at 00:32, Pranith Kumar Karampuri pkara...@redhat.com wrote: Could you let me know if you see this problem without lvm as well? Pranith On 03/18/2015 08:25 PM, Alastair Neil wrote: I am in the process of replacing the bricks with thinly provisioned lvs yes. On 18 March 2015 at 09:35, Pranith Kumar Karampuri pkara...@redhat.com wrote: hi, Are you using thin-lvm based backend on which the bricks are created? Pranith On 03/18/2015 02:05 AM, Alastair Neil wrote: I have a Ovirt cluster with 6 VM hosts and 4 gluster nodes. There are two virtualisation clusters one with two nehelem nodes and one with four sandybridge nodes. My master storage domain is a GlusterFS backed by a replica 3 gluster volume from 3 of the gluster nodes. The engine is a hosted engine 3.5.1 on 3 of the sandybridge nodes, with storage broviede by nfs from a different gluster volume. All the hosts are CentOS 6.6. vdsm-4.16.10-8.gitc937927.el6 glusterfs-3.6.2-1.el6 2.6.32 - 504.8.1.el6.x86_64 Problems happen when I try to add a new brick or replace a brick eventually the self heal will kill the VMs. In the VM's logs I see kernel hung task messages. Mar 12 23:05:16 static1 kernel: INFO: task nginx:1736 blocked for more than 120 seconds. Mar 12 23:05:16 static1 kernel: Not tainted 2.6.32-504.3.3.el6.x86_64 #1 Mar 12 23:05:16 static1 kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Mar 12 23:05:16 static1 kernel: nginx D 0001 0 1736 1735 0x0080 Mar 12 23:05:16 static1 kernel: 8800778b17a8 0082 000126c0 Mar 12 23:05:16 static1 kernel: 88007e5c6500 880037170080 0006ce5c85bd9185 88007e5c64d0 Mar 12 23:05:16 static1 kernel: 88007a614ae0 0001722b64ba 88007a615098 8800778b1fd8 Mar 12 23:05:16 static1 kernel: Call Trace: Mar 12 23:05:16 static1 kernel: [8152a885] schedule_timeout+0x215/0x2e0 Mar 12 23:05:16 static1 kernel: [8152a503] wait_for_common+0x123/0x180 Mar 12 23:05:16 static1 kernel: [81064b90] ? default_wake_function+0x0/0x20 Mar 12 23:05:16 static1 kernel: [a0210a76] ? _xfs_buf_read+0x46/0x60 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] ? xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [8152a61d] wait_for_completion+0x1d/0x20 Mar 12 23:05:16 static1 kernel: [a020ff5b] xfs_buf_iowait+0x9b/0x100 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] ? xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [a0210a76] _xfs_buf_read+0x46/0x60 [xfs] Mar 12 23:05:16 static1 kernel: [a0210b3b] xfs_buf_read+0xab/0x100 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [a01ee6a4] xfs_imap_to_bp+0x54/0x130 [xfs] Mar 12 23:05:16 static1 kernel: [a01f077b] xfs_iread+0x7b/0x1b0 [xfs] Mar 12 23:05:16 static1 kernel: [811ab77e] ? inode_init_always+0x11e/0x1c0 Mar 12 23:05:16 static1 kernel: [a01eb5ee] xfs_iget+0x27e/0x6e0 [xfs] Mar 12 23:05:16 static1 kernel: [a01eae1d] ? xfs_iunlock+0x5d/0xd0 [xfs] Mar 12 23:05:16 static1 kernel: [a0209366] xfs_lookup+0xc6/0x110 [xfs] Mar 12 23:05:16 static1 kernel: [a0216024] xfs_vn_lookup+0x54/0xa0 [xfs] Mar 12 23:05:16 static1 kernel: [8119dc65] do_lookup+0x1a5/0x230 Mar 12 23:05:16 static1 kernel: [8119e8f4] __link_path_walk+0x7a4/0x1000 Mar 12 23:05:16 static1 kernel: [811738e7] ? cache_grow+0x217/0x320 Mar 12 23:05:16 static1 kernel: [8119f40a] path_walk+0x6a/0xe0 Mar 12 23:05:16 static1 kernel: [8119f61b] filename_lookup+0x6b/0xc0 Mar 12 23:05:16 static1 kernel: [811a0747]
Re: [ovirt-users] VMs freezing during heals
CentOS 6.6 vdsm-4.16.10-8.gitc937927.el6 glusterfs-3.6.2-1.el6 2.6.32 - 504.8.1.el6.x86_64 moved to 3.6 specifically to get the snapshotting feature, hence my desire to migrate to thinly provisioned lvm bricks. On 20 March 2015 at 14:57, Darrell Budic bu...@onholyground.com wrote: What version of gluster are you running on these? I’ve seen high load during heals bounce my hosted engine around due to overall system load, but never pause anything else. Cent 7 combo storage/host systems, gluster 3.5.2. On Mar 20, 2015, at 9:57 AM, Alastair Neil ajneil.t...@gmail.com wrote: Pranith I have run a pretty straightforward test. I created a two brick 50 G replica volume with normal lvm bricks, and installed two servers, one centos 6.6 and one centos 7.0. I kicked off bonnie++ on both to generate some file system activity and then made the volume replica 3. I saw no issues on the servers. Not clear if this is a sufficiently rigorous test and the Volume I have had issues on is a 3TB volume with about 2TB used. -Alastair On 19 March 2015 at 12:30, Alastair Neil ajneil.t...@gmail.com wrote: I don't think I have the resources to test it meaningfully. I have about 50 vms on my primary storage domain. I might be able to set up a small 50 GB volume and provision 2 or 3 vms running test loads but I'm not sure it would be comparable. I'll give it a try and let you know if I see similar behaviour. On 19 March 2015 at 11:34, Pranith Kumar Karampuri pkara...@redhat.com wrote: Without thinly provisioned lvm. Pranith On 03/19/2015 08:01 PM, Alastair Neil wrote: do you mean raw partitions as bricks or simply with out thin provisioned lvm? On 19 March 2015 at 00:32, Pranith Kumar Karampuri pkara...@redhat.com wrote: Could you let me know if you see this problem without lvm as well? Pranith On 03/18/2015 08:25 PM, Alastair Neil wrote: I am in the process of replacing the bricks with thinly provisioned lvs yes. On 18 March 2015 at 09:35, Pranith Kumar Karampuri pkara...@redhat.com wrote: hi, Are you using thin-lvm based backend on which the bricks are created? Pranith On 03/18/2015 02:05 AM, Alastair Neil wrote: I have a Ovirt cluster with 6 VM hosts and 4 gluster nodes. There are two virtualisation clusters one with two nehelem nodes and one with four sandybridge nodes. My master storage domain is a GlusterFS backed by a replica 3 gluster volume from 3 of the gluster nodes. The engine is a hosted engine 3.5.1 on 3 of the sandybridge nodes, with storage broviede by nfs from a different gluster volume. All the hosts are CentOS 6.6. vdsm-4.16.10-8.gitc937927.el6 glusterfs-3.6.2-1.el6 2.6.32 - 504.8.1.el6.x86_64 Problems happen when I try to add a new brick or replace a brick eventually the self heal will kill the VMs. In the VM's logs I see kernel hung task messages. Mar 12 23:05:16 static1 kernel: INFO: task nginx:1736 blocked for more than 120 seconds. Mar 12 23:05:16 static1 kernel: Not tainted 2.6.32-504.3.3.el6.x86_64 #1 Mar 12 23:05:16 static1 kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Mar 12 23:05:16 static1 kernel: nginx D 0001 0 1736 1735 0x0080 Mar 12 23:05:16 static1 kernel: 8800778b17a8 0082 000126c0 Mar 12 23:05:16 static1 kernel: 88007e5c6500 880037170080 0006ce5c85bd9185 88007e5c64d0 Mar 12 23:05:16 static1 kernel: 88007a614ae0 0001722b64ba 88007a615098 8800778b1fd8 Mar 12 23:05:16 static1 kernel: Call Trace: Mar 12 23:05:16 static1 kernel: [8152a885] schedule_timeout+0x215/0x2e0 Mar 12 23:05:16 static1 kernel: [8152a503] wait_for_common+0x123/0x180 Mar 12 23:05:16 static1 kernel: [81064b90] ? default_wake_function+0x0/0x20 Mar 12 23:05:16 static1 kernel: [a0210a76] ? _xfs_buf_read+0x46/0x60 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] ? xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [8152a61d] wait_for_completion+0x1d/0x20 Mar 12 23:05:16 static1 kernel: [a020ff5b] xfs_buf_iowait+0x9b/0x100 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] ? xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [a0210a76] _xfs_buf_read+0x46/0x60 [xfs] Mar 12 23:05:16 static1 kernel: [a0210b3b] xfs_buf_read+0xab/0x100 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [a01ee6a4] xfs_imap_to_bp+0x54/0x130 [xfs] Mar 12 23:05:16 static1 kernel: [a01f077b] xfs_iread+0x7b/0x1b0 [xfs] Mar 12 23:05:16 static1 kernel: [811ab77e] ? inode_init_always+0x11e/0x1c0 Mar 12 23:05:16 static1 kernel: [a01eb5ee] xfs_iget+0x27e/0x6e0 [xfs] Mar 12 23:05:16 static1 kernel: [a01eae1d] ? xfs_iunlock+0x5d/0xd0 [xfs] Mar 12
Re: [ovirt-users] VMs freezing during heals
hi, Are you using thin-lvm based backend on which the bricks are created? Pranith On 03/18/2015 02:05 AM, Alastair Neil wrote: I have a Ovirt cluster with 6 VM hosts and 4 gluster nodes. There are two virtualisation clusters one with two nehelem nodes and one with four sandybridge nodes. My master storage domain is a GlusterFS backed by a replica 3 gluster volume from 3 of the gluster nodes. The engine is a hosted engine 3.5.1 on 3 of the sandybridge nodes, with storage broviede by nfs from a different gluster volume. All the hosts are CentOS 6.6. vdsm-4.16.10-8.gitc937927.el6 glusterfs-3.6.2-1.el6 2.6.32 - 504.8.1.el6.x86_64 Problems happen when I try to add a new brick or replace a brick eventually the self heal will kill the VMs. In the VM's logs I see kernel hung task messages. Mar 12 23:05:16 static1 kernel: INFO: task nginx:1736 blocked for more than 120 seconds. Mar 12 23:05:16 static1 kernel: Not tainted 2.6.32-504.3.3.el6.x86_64 #1 Mar 12 23:05:16 static1 kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Mar 12 23:05:16 static1 kernel: nginx D 0001 0 1736 1735 0x0080 Mar 12 23:05:16 static1 kernel: 8800778b17a8 0082 000126c0 Mar 12 23:05:16 static1 kernel: 88007e5c6500 880037170080 0006ce5c85bd9185 88007e5c64d0 Mar 12 23:05:16 static1 kernel: 88007a614ae0 0001722b64ba 88007a615098 8800778b1fd8 Mar 12 23:05:16 static1 kernel: Call Trace: Mar 12 23:05:16 static1 kernel: [8152a885] schedule_timeout+0x215/0x2e0 Mar 12 23:05:16 static1 kernel: [8152a503] wait_for_common+0x123/0x180 Mar 12 23:05:16 static1 kernel: [81064b90] ? default_wake_function+0x0/0x20 Mar 12 23:05:16 static1 kernel: [a0210a76] ? _xfs_buf_read+0x46/0x60 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] ? xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [8152a61d] wait_for_completion+0x1d/0x20 Mar 12 23:05:16 static1 kernel: [a020ff5b] xfs_buf_iowait+0x9b/0x100 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] ? xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [a0210a76] _xfs_buf_read+0x46/0x60 [xfs] Mar 12 23:05:16 static1 kernel: [a0210b3b] xfs_buf_read+0xab/0x100 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [a01ee6a4] xfs_imap_to_bp+0x54/0x130 [xfs] Mar 12 23:05:16 static1 kernel: [a01f077b] xfs_iread+0x7b/0x1b0 [xfs] Mar 12 23:05:16 static1 kernel: [811ab77e] ? inode_init_always+0x11e/0x1c0 Mar 12 23:05:16 static1 kernel: [a01eb5ee] xfs_iget+0x27e/0x6e0 [xfs] Mar 12 23:05:16 static1 kernel: [a01eae1d] ? xfs_iunlock+0x5d/0xd0 [xfs] Mar 12 23:05:16 static1 kernel: [a0209366] xfs_lookup+0xc6/0x110 [xfs] Mar 12 23:05:16 static1 kernel: [a0216024] xfs_vn_lookup+0x54/0xa0 [xfs] Mar 12 23:05:16 static1 kernel: [8119dc65] do_lookup+0x1a5/0x230 Mar 12 23:05:16 static1 kernel: [8119e8f4] __link_path_walk+0x7a4/0x1000 Mar 12 23:05:16 static1 kernel: [811738e7] ? cache_grow+0x217/0x320 Mar 12 23:05:16 static1 kernel: [8119f40a] path_walk+0x6a/0xe0 Mar 12 23:05:16 static1 kernel: [8119f61b] filename_lookup+0x6b/0xc0 Mar 12 23:05:16 static1 kernel: [811a0747] user_path_at+0x57/0xa0 Mar 12 23:05:16 static1 kernel: [a0204e74] ? _xfs_trans_commit+0x214/0x2a0 [xfs] Mar 12 23:05:16 static1 kernel: [a01eae3e] ? xfs_iunlock+0x7e/0xd0 [xfs] Mar 12 23:05:16 static1 kernel: [81193bc0] vfs_fstatat+0x50/0xa0 Mar 12 23:05:16 static1 kernel: [811aaf5d] ? touch_atime+0x14d/0x1a0 Mar 12 23:05:16 static1 kernel: [81193d3b] vfs_stat+0x1b/0x20 Mar 12 23:05:16 static1 kernel: [81193d64] sys_newstat+0x24/0x50 Mar 12 23:05:16 static1 kernel: [810e5c87] ? audit_syscall_entry+0x1d7/0x200 Mar 12 23:05:16 static1 kernel: [810e5a7e] ? __audit_syscall_exit+0x25e/0x290 Mar 12 23:05:16 static1 kernel: [8100b072] system_call_fastpath+0x16/0x1b I am wondering if my volume settings are causing this. Can anyone with more knowledge take a look and let me know: network.remote-dio: on performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off nfs.export-volumes: on network.ping-timeout: 20 cluster.self-heal-readdir-size: 64KB cluster.quorum-type: auto cluster.data-self-heal-algorithm: diff
[ovirt-users] VMs freezing during heals
I have a Ovirt cluster with 6 VM hosts and 4 gluster nodes. There are two virtualisation clusters one with two nehelem nodes and one with four sandybridge nodes. My master storage domain is a GlusterFS backed by a replica 3 gluster volume from 3 of the gluster nodes. The engine is a hosted engine 3.5.1 on 3 of the sandybridge nodes, with storage broviede by nfs from a different gluster volume. All the hosts are CentOS 6.6. vdsm-4.16.10-8.gitc937927.el6 glusterfs-3.6.2-1.el6 2.6.32 - 504.8.1.el6.x86_64 Problems happen when I try to add a new brick or replace a brick eventually the self heal will kill the VMs. In the VM's logs I see kernel hung task messages. Mar 12 23:05:16 static1 kernel: INFO: task nginx:1736 blocked for more than 120 seconds. Mar 12 23:05:16 static1 kernel: Not tainted 2.6.32-504.3.3.el6.x86_64 #1 Mar 12 23:05:16 static1 kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Mar 12 23:05:16 static1 kernel: nginx D 0001 0 1736 1735 0x0080 Mar 12 23:05:16 static1 kernel: 8800778b17a8 0082 000126c0 Mar 12 23:05:16 static1 kernel: 88007e5c6500 880037170080 0006ce5c85bd9185 88007e5c64d0 Mar 12 23:05:16 static1 kernel: 88007a614ae0 0001722b64ba 88007a615098 8800778b1fd8 Mar 12 23:05:16 static1 kernel: Call Trace: Mar 12 23:05:16 static1 kernel: [8152a885] schedule_timeout+0x215/0x2e0 Mar 12 23:05:16 static1 kernel: [8152a503] wait_for_common+0x123/0x180 Mar 12 23:05:16 static1 kernel: [81064b90] ? default_wake_function+0x0/0x20 Mar 12 23:05:16 static1 kernel: [a0210a76] ? _xfs_buf_read+0x46/0x60 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] ? xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [8152a61d] wait_for_completion+0x1d/0x20 Mar 12 23:05:16 static1 kernel: [a020ff5b] xfs_buf_iowait+0x9b/0x100 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] ? xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [a0210a76] _xfs_buf_read+0x46/0x60 [xfs] Mar 12 23:05:16 static1 kernel: [a0210b3b] xfs_buf_read+0xab/0x100 [xfs] Mar 12 23:05:16 static1 kernel: [a02063c7] xfs_trans_read_buf+0x197/0x410 [xfs] Mar 12 23:05:16 static1 kernel: [a01ee6a4] xfs_imap_to_bp+0x54/0x130 [xfs] Mar 12 23:05:16 static1 kernel: [a01f077b] xfs_iread+0x7b/0x1b0 [xfs] Mar 12 23:05:16 static1 kernel: [811ab77e] ? inode_init_always+0x11e/0x1c0 Mar 12 23:05:16 static1 kernel: [a01eb5ee] xfs_iget+0x27e/0x6e0 [xfs] Mar 12 23:05:16 static1 kernel: [a01eae1d] ? xfs_iunlock+0x5d/0xd0 [xfs] Mar 12 23:05:16 static1 kernel: [a0209366] xfs_lookup+0xc6/0x110 [xfs] Mar 12 23:05:16 static1 kernel: [a0216024] xfs_vn_lookup+0x54/0xa0 [xfs] Mar 12 23:05:16 static1 kernel: [8119dc65] do_lookup+0x1a5/0x230 Mar 12 23:05:16 static1 kernel: [8119e8f4] __link_path_walk+0x7a4/0x1000 Mar 12 23:05:16 static1 kernel: [811738e7] ? cache_grow+0x217/0x320 Mar 12 23:05:16 static1 kernel: [8119f40a] path_walk+0x6a/0xe0 Mar 12 23:05:16 static1 kernel: [8119f61b] filename_lookup+0x6b/0xc0 Mar 12 23:05:16 static1 kernel: [811a0747] user_path_at+0x57/0xa0 Mar 12 23:05:16 static1 kernel: [a0204e74] ? _xfs_trans_commit+0x214/0x2a0 [xfs] Mar 12 23:05:16 static1 kernel: [a01eae3e] ? xfs_iunlock+0x7e/0xd0 [xfs] Mar 12 23:05:16 static1 kernel: [81193bc0] vfs_fstatat+0x50/0xa0 Mar 12 23:05:16 static1 kernel: [811aaf5d] ? touch_atime+0x14d/0x1a0 Mar 12 23:05:16 static1 kernel: [81193d3b] vfs_stat+0x1b/0x20 Mar 12 23:05:16 static1 kernel: [81193d64] sys_newstat+0x24/0x50 Mar 12 23:05:16 static1 kernel: [810e5c87] ? audit_syscall_entry+0x1d7/0x200 Mar 12 23:05:16 static1 kernel: [810e5a7e] ? __audit_syscall_exit+0x25e/0x290 Mar 12 23:05:16 static1 kernel: [8100b072] system_call_fastpath+0x16/0x1b I am wondering if my volume settings are causing this. Can anyone with more knowledge take a look and let me know: network.remote-dio: on performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off nfs.export-volumes: on network.ping-timeout: 20 cluster.self-heal-readdir-size: 64KB cluster.quorum-type: auto cluster.data-self-heal-algorithm: diff cluster.self-heal-window-size: 8 cluster.heal-timeout: 500 cluster.self-heal-daemon: on cluster.entry-self-heal: on cluster.data-self-heal: on cluster.metadata-self-heal: on cluster.readdir-optimize: on cluster.background-self-heal-count: 20 cluster.rebalance-stats: on cluster.min-free-disk: 5% cluster.eager-lock: enable storage.owner-uid: 36 storage.owner-gid: 36 auth.allow:* user.cifs: disable cluster.server-quorum-ratio: 51% Many Thanks, Alastair