Re: [ovirt-users] VM has been paused due to storage I/O problem
On Thu, Feb 2, 2017 at 10:48 AM, Nir Soffer wrote: > > > > > After about 7 rounds I get this in messages of the host where the VM is > > running: > > > > Feb 1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:44 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:47 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:56 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:57 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > This is interesting, we have seen this messages before, but could never > detect the flow causing them, are you sure you see this each time you > extend > your disk? > > If you can reproduce this, please file a bug. > I got the same on 4.1.1 while testing on iSCSI so I opened a bugzilla now... let's see if we can find anything usefule https://bugzilla.redhat.com/show_bug.cgi?id=1438809 Gianluca ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Fri, Feb 03, 2017 at 12:31:49AM +0100, Gianluca Cecchi wrote: >On Thu, Feb 2, 2017 at 10:53 PM, Benjamin Marzinski ><[1]bmarz...@redhat.com> wrote: > > > > I'm trying to mitigate inserting a timeout for my SAN devices but > I'm not > > > sure of its effectiveness as CentOS 7 behavior of "multipathd -k" > and then > > > "show config" seems different from CentOS 6.x > > > In fact my attempt for multipath.conf is this > > There was a significant change in how multipath deals with merging > device configurations between RHEL6 and RHEL7. The short answer is, as > long as you copy the entire existing configuration, and just change what > you want changed (like you did), you can ignore the change. Also, > multipath doesn't care if you quote numbers. > > If you want to verify that no_path_retry is being set as intented, you > can run: > > # multipath -r -v3 | grep no_path_retry > >Hi Benjamin, >thank you very much for the explanations, especially the long one ;-) >I tried and confirmed that I has no_path_retry = 4 as expected >The regex matching is only for merge, correct? No. Both RHEL6 and RHEL7 use regex matching to determine which device configuration to use with your device, otherwise product "^1814" would never match any device, since there is no array with a literal product string of "^1814". RHEL7 also uses the same regex matching to determine which builtin device configuration a user-supplied device configuration should modify. RHEL6 uses string matching for this. >So in your example if in RH EL 7 I put this > device { > vendor "IBM" > product "^1814" > no_path_retry 12 > } >It would not match for merging, but it would match for applying to my >device (because it is put at the end of config read backwards). correct. The confusing point is that in the merging case, "^1814" in the user-supplied configuration is being treaded as a string that needs to regex match the regular expression "^1814" in the builtin configuration. These don't match. For matching the device configuration to the device, "^1814" in the user-supplied configuration is being treated as a regular expression that needs to regex match the actual product string of the device. >And it would apply only the no_path_retry setting, while all other ones >would not be picked from builtin configuration for device, but from >defaults in general. >So for example it would set path_checker not this way: >path_checker "rdac" >but this way: >path_checker "directio" >that is default.. >correct? exactly. -Ben > References > >Visible links >1. mailto:bmarz...@redhat.com ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Thu, Feb 2, 2017 at 10:53 PM, Benjamin Marzinski wrote: > > > > I'm trying to mitigate inserting a timeout for my SAN devices but I'm > not > > > sure of its effectiveness as CentOS 7 behavior of "multipathd -k" and > then > > > "show config" seems different from CentOS 6.x > > > In fact my attempt for multipath.conf is this > > There was a significant change in how multipath deals with merging > device configurations between RHEL6 and RHEL7. The short answer is, as > long as you copy the entire existing configuration, and just change what > you want changed (like you did), you can ignore the change. Also, > multipath doesn't care if you quote numbers. > > If you want to verify that no_path_retry is being set as intented, you > can run: > > # multipath -r -v3 | grep no_path_retry > Hi Benjamin, thank you very much for the explanations, especially the long one ;-) I tried and confirmed that I has no_path_retry = 4 as expected The regex matching is only for merge, correct? So in your example if in RH EL 7 I put this device { vendor "IBM" product "^1814" no_path_retry 12 } It would not match for merging, but it would match for applying to my device (because it is put at the end of config read backwards). And it would apply only the no_path_retry setting, while all other ones would not be picked from builtin configuration for device, but from defaults in general. So for example it would set path_checker not this way: path_checker "rdac" but this way: path_checker "directio" that is default.. correct? ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Wed, Feb 01, 2017 at 09:39:45AM +0200, Nir Soffer wrote: > On Tue, Jan 31, 2017 at 6:09 PM, Gianluca Cecchi > wrote: > > On Tue, Jan 31, 2017 at 3:23 PM, Nathanaël Blanchet > > wrote: > >> > >> exactly the same issue by there with FC EMC domain storage... > >> > >> > > > > I'm trying to mitigate inserting a timeout for my SAN devices but I'm not > > sure of its effectiveness as CentOS 7 behavior of "multipathd -k" and then > > "show config" seems different from CentOS 6.x > > In fact my attempt for multipath.conf is this There was a significant change in how multipath deals with merging device configurations between RHEL6 and RHEL7. The short answer is, as long as you copy the entire existing configuration, and just change what you want changed (like you did), you can ignore the change. Also, multipath doesn't care if you quote numbers. If you want to verify that no_path_retry is being set as intented, you can run: # multipath -r -v3 | grep no_path_retry To reload you multipath devices with verbosity turned up. You shoudl see lines like: Feb 02 09:38:30 | mpatha: no_path_retry = 12 (controller setting) That will tell you what no_path_retry is set to. The configuration Nir suggested at the end of this email looks good to me. Now, here's the long answer: multipath allows you to merge device configurations. This means that as long as you put in the "vendor" and "product" strings, you only need to set the other values that you care about. On RHEL6, this would work device { vendor "IBM" product "^1814" no_path_retry 12 } And it would create a configuration that was exactly the same as the builtin config for this device, except that no_path_retry was set to 12. However, this wasn't as easy for users as it was supposed to be. Specifically, users would often add their device's vendor and product information, as well as whatever they wanted changed, and then be surprised when multipath didn't retain all the information from the builtin configuration as advertised. This is because they used the actual vendor and product strings for their device, but the builtin device configuration's vendor and product strings were regexes. In RHEL6, multipath only merged configurations if the vendor and product strings string matched. So users would try device { vendor "IBM" product "1814 FASt" no_path_retry 12 } and it wouldn't work as expected, since the product strings didn't match. To fix this, when RHEL7 checks if a user configuration should be merged with a builtin configuration, all that is required is that the user configuration's vendor and product strings regex match the builtin. This means that the above configuration will work as expected in RHEL7. However the first configuration won't because "^1814" doesn't regex match "^1814". This means that multipath would treat is as a completely new configuration, and not merge any values from the builtin configuration. You can reenable the RHEL6 behaviour in RHEL7 by setting hw_str_match yes in the defaults section. Now, because the builtin configurations could handle more than one device type per configuration, since they used regexes to match the vendor and product strings, multipath couldn't just remove the original builtin configuration when users added a new configuration that modified it. Otherwise, devices that regex matched the builtin configuration's vendor and product strings but not the user configuration's vendor and product strings wouldn't have any device configuration information. So multipath keeps the original builtin configuration as well as the new one. However, when it's time to assign a device configuration to a device, multipath looks through the device configurations list backwards, and finds the first match. This means that it will always use the user configuration instead of the builtin one (since new configurations get added to the end of the list). Like I said before, if you add all the values you want set in your configuration, instead of relying on them being merged from the builtin configuration, then you don't need to worry about any of this. -Ben > > > > # VDSM REVISION 1.3 > > # VDSM PRIVATE > > > > defaults { > > polling_interval5 > > no_path_retry fail > > user_friendly_names no > > flush_on_last_del yes > > fast_io_fail_tmo5 > > dev_loss_tmo30 > > max_fds 4096 > > } > > > > # Remove devices entries when overrides section is available. > > devices { > > device { > > # These settings overrides built-in devices settings. It does not > > apply > > # to devices without built-in settings (these use the settings in > > the > > # "defaults" section), or to devices defined in the "devices" > > section. > > # Note: This is not available yet
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Thu, Feb 2, 2017 at 6:05 PM, Gianluca Cecchi wrote: > On Thu, Feb 2, 2017 at 3:51 PM, Nir Soffer wrote: >> >> >> > Can you confirm that the host can be active when I restart vdsmd >> > service? >> >> Sure. This may abort a storage operation if one is running when you >> restart >> vdsm, but vdsm is designed so you can restart or kill it safely. >> >> For example, if you abort a disk copy in the middle, the operation will >> fail >> and the destination disk will be deleted. >> >> If you want to avoid such issue, you can put a host to maintenance, but >> this >> requires migration of vms to other hosts. >> >> Nir > > > OK. Created 50_thin_block_extension_rules.conf under /etc/vdsm/vdsm.conf.d > and restarted vdsmd > > One last (latest probably... ;-) question > Is it expected that if I restart vdsmd on the host that is the SPM, then SPM > is shifted to another node? Yes, engine will move spm to another host when spm fails, unless you disabled spm role for any other host (see host > spm tab). > Because when restarting vdsmd on the host that is not SPM I didn't get any > message in web admin gui and restart of vdsmd itself was very fast. > Instead on the host with SPM, the command took several seconds and I got > these events It is expected the restarting the spm is slower, but we need to see vdsm logs to understand why. > Feb 2, 2017 4:01:23 PM Host ovmsrv05 power management was verified > successfully. > Feb 2, 2017 4:01:23 PM Status of host ovmsrv05 was set to Up. > Feb 2, 2017 4:01:19 PM Executing power management status on Host ovmsrv05 > using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. > Feb 2, 2017 4:01:18 PM Storage Pool Manager runs on Host ovmsrv06 (Address: > ovmsrv06.datacenter.polimi.it). > Feb 2, 2017 4:01:13 PM VDSM ovmsrv05 command failed: Recovering from crash > or Initializing > Feb 2, 2017 4:01:11 PM Host ovmsrv05 is initializing. Message: Recovering > from crash or Initializing > Feb 2, 2017 4:01:11 PM VDSM ovmsrv05 command failed: Recovering from crash > or Initializing > Feb 2, 2017 4:01:11 PM Invalid status on Data Center Default. Setting Data > Center status to Non Responsive (On host ovmsrv05, Error: Recovering from > crash or Initializing). > Feb 2, 2017 4:01:11 PM VDSM ovmsrv05 command failed: Recovering from crash > or Initializing > Feb 2, 2017 4:01:05 PM Host ovmsrv05 is not responding. It will stay in > Connecting state for a grace period of 80 seconds and after that an attempt > to fence the host will be issued. > Feb 2, 2017 4:01:05 PM Host ovmsrv05 is not responding. It will stay in > Connecting state for a grace period of 80 seconds and after that an attempt > to fence the host will be issued. > Feb 2, 2017 4:01:05 PM VDSM ovmsrv05 command failed: Connection reset by > peer It look like the engine discovered that the SPM was down, and reconnected. It is expected that changes in the spm status are detected early and engine is trying to recover the spm, the SPM role is critical in ovirt. Are you sure you did not get any message when restarting the other host? I would expect that engine detect and report a restart of all hosts. If you can reproduce this, restarting vdsm is not detected on engine and not reported in engine even log, please file a bug. Nir ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Thu, Feb 2, 2017 at 3:51 PM, Nir Soffer wrote: > > > Can you confirm that the host can be active when I restart vdsmd service? > > Sure. This may abort a storage operation if one is running when you restart > vdsm, but vdsm is designed so you can restart or kill it safely. > > For example, if you abort a disk copy in the middle, the operation will > fail > and the destination disk will be deleted. > > If you want to avoid such issue, you can put a host to maintenance, but > this > requires migration of vms to other hosts. > > Nir > OK. Created 50_thin_block_extension_rules.conf under /etc/vdsm/vdsm.conf.d and restarted vdsmd One last (latest probably... ;-) question Is it expected that if I restart vdsmd on the host that is the SPM, then SPM is shifted to another node? Because when restarting vdsmd on the host that is not SPM I didn't get any message in web admin gui and restart of vdsmd itself was very fast. Instead on the host with SPM, the command took several seconds and I got these events Feb 2, 2017 4:01:23 PM Host ovmsrv05 power management was verified successfully. Feb 2, 2017 4:01:23 PM Status of host ovmsrv05 was set to Up. Feb 2, 2017 4:01:19 PM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Feb 2, 2017 4:01:18 PM Storage Pool Manager runs on Host ovmsrv06 (Address: ovmsrv06.datacenter.polimi.it). Feb 2, 2017 4:01:13 PM VDSM ovmsrv05 command failed: Recovering from crash or Initializing Feb 2, 2017 4:01:11 PM Host ovmsrv05 is initializing. Message: Recovering from crash or Initializing Feb 2, 2017 4:01:11 PM VDSM ovmsrv05 command failed: Recovering from crash or Initializing Feb 2, 2017 4:01:11 PM Invalid status on Data Center Default. Setting Data Center status to Non Responsive (On host ovmsrv05, Error: Recovering from crash or Initializing). Feb 2, 2017 4:01:11 PM VDSM ovmsrv05 command failed: Recovering from crash or Initializing Feb 2, 2017 4:01:05 PM Host ovmsrv05 is not responding. It will stay in Connecting state for a grace period of 80 seconds and after that an attempt to fence the host will be issued. Feb 2, 2017 4:01:05 PM Host ovmsrv05 is not responding. It will stay in Connecting state for a grace period of 80 seconds and after that an attempt to fence the host will be issued. Feb 2, 2017 4:01:05 PM VDSM ovmsrv05 command failed: Connection reset by peer Thanks Gianluca ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Thu, Feb 2, 2017 at 4:42 PM, Gianluca Cecchi wrote: > On Thu, Feb 2, 2017 at 3:30 PM, Nir Soffer wrote: >> >> >> If you don't have a vdsm.conf file, or the file is empty, you can >> generate a new file >> like this: >> >> python /usr/lib64/python2.7/site-packages/vdsm/config.py > >> vdsm.conf.examle > > > thanks. > It seems that the package python-libs-2.7.5-48.el7.x86_64 actually uses > /usr/lib path instead of /usr/lib64... > What worked > python /usr/lib/python2.7/site-packages/vdsm/config.py > vdsm.conf.example Indeed, my mistake, we move do lib several versions ago. > >> >> > >> > >> >> >> >> >> >> > What I should do after changing them to make them active? >> >> >> >> Restart vdsm >> > >> > >> > with host into maintenance mode or could it be active? >> > >> > Thanks >> > > > > Can you confirm that the host can be active when I restart vdsmd service? Sure. This may abort a storage operation if one is running when you restart vdsm, but vdsm is designed so you can restart or kill it safely. For example, if you abort a disk copy in the middle, the operation will fail and the destination disk will be deleted. If you want to avoid such issue, you can put a host to maintenance, but this requires migration of vms to other hosts. Nir ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Thu, Feb 2, 2017 at 3:30 PM, Nir Soffer wrote: > > If you don't have a vdsm.conf file, or the file is empty, you can > generate a new file > like this: > > python /usr/lib64/python2.7/site-packages/vdsm/config.py > > vdsm.conf.examle > thanks. It seems that the package python-libs-2.7.5-48.el7.x86_64 actually uses /usr/lib path instead of /usr/lib64... What worked python /usr/lib/python2.7/site-packages/vdsm/config.py > vdsm.conf.example > > > > > >> > >> > >> > What I should do after changing them to make them active? > >> > >> Restart vdsm > > > > > > with host into maintenance mode or could it be active? > > > > Thanks > > > Can you confirm that the host can be active when I restart vdsmd service? Thanks Gianluca ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Thu, Feb 2, 2017 at 4:16 PM, Gianluca Cecchi wrote: > On Thu, Feb 2, 2017 at 12:48 PM, Nir Soffer wrote: >> >> >> > >> > What about [irs] values? Where are they located, in vdsm.conf? >> >> Yes but you should not modify them in vdsm.conf. >> >> > What are defaults for volume_utilization_percent and >> > volume_utilization_chunk_mb? >> > Did they change from 3.6 to 4.0 to 4.1? >> >> No, the defaults did not change in the last 3.5 years. > > > OK. > In a previous message of this thread you wrote that the default > volume_utilization_chunk_mb is 1024 > What about default for volume_utilization_percent? 50? Yes, this should be a commented out option in /etc/vdsm/vdsm.conf. If you don't have a vdsm.conf file, or the file is empty, you can generate a new file like this: python /usr/lib64/python2.7/site-packages/vdsm/config.py > vdsm.conf.examle > > >> >> >> > What I should do after changing them to make them active? >> >> Restart vdsm > > > with host into maintenance mode or could it be active? > > Thanks > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Thu, Feb 2, 2017 at 12:48 PM, Nir Soffer wrote: > > > > > What about [irs] values? Where are they located, in vdsm.conf? > > Yes but you should not modify them in vdsm.conf. > > > What are defaults for volume_utilization_percent and > > volume_utilization_chunk_mb? > > Did they change from 3.6 to 4.0 to 4.1? > > No, the defaults did not change in the last 3.5 years. > OK. In a previous message of this thread you wrote that the default volume_utilization_chunk_mb is 1024 What about default for volume_utilization_percent? 50? > > > What I should do after changing them to make them active? > > Restart vdsm > with host into maintenance mode or could it be active? Thanks ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Thu, Feb 2, 2017 at 1:34 PM, Gianluca Cecchi wrote: > On Thu, Feb 2, 2017 at 12:09 PM, Yaniv Kaul wrote: >> >> >> >>> >>> >>> I decided to switch to preallocated for further tests and confirm >>> So I created a snapshot and then a clone of the VM, changing allocation >>> policy of the disk to preallocated. >>> So far so good. >>> >>> Feb 2, 2017 10:40:23 AM VM ol65preallocated creation has been completed. >>> Feb 2, 2017 10:24:15 AM VM ol65preallocated creation was initiated by >>> admin@internal-authz. >>> Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' has >>> been completed. >>> Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' was >>> initiated by admin@internal-authz. >>> >>> so the throughput seems ok based on this storage type (the LUNs are on >>> RAID5 made with sata disks): 16 minutes to write 90Gb is about 96MBytes/s, >>> what expected >> >> >> What is your expectation? Is it FC, iSCSI? How many paths? What is the IO >> scheduler in the VM? Is it using virtio-blk or virtio-SCSI? >> Y. >> > > Peak bandwith no more than 140 MBytes/s, based on storage capabilities, but > I don't have to do a rude performance test. I need stability > Hosts has a mezzanine dual-port HBA (4 Gbit); each HBA connected to a > different FC-switch and the multipath connection has 2 active paths (one for > each HBA). > > I confirm that with preallocated disk of the cloned VM I don't have indeed > the previous problems. > The same loop executed for about 66 times in a 10 minutes interval without > any problem registered on hosts > No message at all in /var/log/messages of both hosts. > My storage domain not compromised > It remains important the question about thin provisioning and SAN LUNs (aka > with LVM based disks). > In my opinion I shouldn't care of the kind of I/O made inside a VM and > anyway it shouldn't interfere with my storage domain, bringing down > completely my hosts/VMs. > In theory there could be an application inside a VM that generates something > similar to my loop and so would generate problems. > For sure I can then notify VM responsible about his/her workload, but it > should not compromise my virtual infrastructure > I could have an RDBMS inside a VM and a user that creates a big datafile and > that should imply many extend operations if the disk is thin provisioned > > What about [irs] values? Where are they located, in vdsm.conf? Yes but you should not modify them in vdsm.conf. > What are defaults for volume_utilization_percent and > volume_utilization_chunk_mb? > Did they change from 3.6 to 4.0 to 4.1? No, the defaults did not change in the last 3.5 years. In 4.0 we introduced dropin support, and this is the recommend way to perform configuration changes. To change these values, you create a file at /etc/vdsm/vdsm.conf.d/50_my.conf The name of the file does not matter, vdsm will read all files in the vdsm.conf.d directory, sort them by name (this is way you should use 50_ prefix), and apply the changes to the configuration. In this file you put the sections and options you need, like: [irs] volume_utilization_percent = 25 volume_utilization_chunk_mb = 2048 > What I should do after changing them to make them active? Restart vdsm Using these method, you can provision the same file on all hosts using standard provisioning tools. It is not recommended to modify /etc/vdsm/vdsm.conf. If you do this you will have to manually merge changes from vdsm.conf.rpmnew after upgrading vdsm. Nir ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Thu, Feb 2, 2017 at 12:09 PM, Yaniv Kaul wrote: > > > >> >> I decided to switch to preallocated for further tests and confirm >> So I created a snapshot and then a clone of the VM, changing allocation >> policy of the disk to preallocated. >> So far so good. >> >> Feb 2, 2017 10:40:23 AM VM ol65preallocated creation has been completed. >> Feb 2, 2017 10:24:15 AM VM ol65preallocated creation was initiated by >> admin@internal-authz. >> Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' has >> been completed. >> Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' was >> initiated by admin@internal-authz. >> >> so the throughput seems ok based on this storage type (the LUNs are on >> RAID5 made with sata disks): 16 minutes to write 90Gb is about 96MBytes/s, >> what expected >> > > What is your expectation? Is it FC, iSCSI? How many paths? What is the IO > scheduler in the VM? Is it using virtio-blk or virtio-SCSI? > Y. > > Peak bandwith no more than 140 MBytes/s, based on storage capabilities, but I don't have to do a rude performance test. I need stability Hosts has a mezzanine dual-port HBA (4 Gbit); each HBA connected to a different FC-switch and the multipath connection has 2 active paths (one for each HBA). I confirm that with preallocated disk of the cloned VM I don't have indeed the previous problems. The same loop executed for about 66 times in a 10 minutes interval without any problem registered on hosts No message at all in /var/log/messages of both hosts. My storage domain not compromised It remains important the question about thin provisioning and SAN LUNs (aka with LVM based disks). In my opinion I shouldn't care of the kind of I/O made inside a VM and anyway it shouldn't interfere with my storage domain, bringing down completely my hosts/VMs. In theory there could be an application inside a VM that generates something similar to my loop and so would generate problems. For sure I can then notify VM responsible about his/her workload, but it should not compromise my virtual infrastructure I could have an RDBMS inside a VM and a user that creates a big datafile and that should imply many extend operations if the disk is thin provisioned What about [irs] values? Where are they located, in vdsm.conf? What are defaults for volume_utilization_percent and volume_utilization_chunk_mb? Did they change from 3.6 to 4.0 to 4.1? What I should do after changing them to make them active? Thanks in advance, Gianluca ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Thu, Feb 2, 2017 at 12:04 PM, Gianluca Cecchi wrote: > > > On Thu, Feb 2, 2017 at 10:48 AM, Nir Soffer wrote: >> >> On Thu, Feb 2, 2017 at 1:11 AM, Gianluca Cecchi >> wrote: >> > On Wed, Feb 1, 2017 at 8:22 PM, Gianluca Cecchi >> > >> > wrote: >> >> >> >> >> >> OK. In the mean time I have applied your suggested config and restarted >> >> the 2 nodes. >> >> Let we test and see if I find any problems running also some I/O tests. >> >> Thanks in the mean time, >> >> Gianluca >> > >> > >> > >> > Quick test without much success >> > >> > Inside the guest I run this loop >> > while true >> > do >> > time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile >> > sleep 5 >> > done >> >> I don't think this test is related to the issues you reported earlier. >> > > I thought the same too, and all related comments you wrote. > I'm going to test the suggested modifications for chunks. > In general do you recommend thin provisioning at all on SAN storage? Only if your storage does no support thin provisioning, or you need snapshot support. If you don't need these feature, using raw will be much more reliable and faster. Even if you use raw, you can still perform live storage migration; we create a snapshot using qcow2 format, copy the base raw volume to another storage, and finally delete the snapshot on the destination storage. In the future (ovirt 5?) we would like to use only smart storage thin provisioning and snapshot support. > I decided to switch to preallocated for further tests and confirm > So I created a snapshot and then a clone of the VM, changing allocation > policy of the disk to preallocated. > So far so good. > > Feb 2, 2017 10:40:23 AM VM ol65preallocated creation has been completed. > Feb 2, 2017 10:24:15 AM VM ol65preallocated creation was initiated by > admin@internal-authz. > Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' has > been completed. > Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' was > initiated by admin@internal-authz. > > so the throughput seems ok based on this storage type (the LUNs are on RAID5 > made with sata disks): 16 minutes to write 90Gb is about 96MBytes/s, what > expected > > What I see in messages during the cloning phasefrom 10:24 to 10:40: > > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:14 ovmsrv05 journal: vdsm root WARNING File: > /rhev/data-center/588237b8-0031-02f6-035d-0136/922b5269-ab56-4c4d-838f-49d33427e2ab/images/9d1c977f-540d-436a-9d93-b1cb0816af2a/607dbf59-7d4d-4fc3-ae5f-e8824bf82648 > already removed > Feb 2 10:24:14 ovmsrv05 multipathd: dm-15: remove map (uevent) > Feb 2 10:24:14 ovmsrv05 multipathd: dm-15: devmap not registered, can't > remove > Feb 2 10:24:14 ovmsrv05 multipathd: dm-15: remove map (uevent) > Feb 2 10:24:17 ovmsrv05 kernel: blk_update_request: critical target error, > dev dm-4, sector 44566529 > Feb 2 10:24:17 ovmsrv05 kernel: dm-15: WRITE SAME failed. Manually zeroing. > Feb 2 10:40:07 ovmsrv05 kernel: scsi_verify_blk_ioctl: 16 callbacks > suppressed > Feb 2 10:40:07 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:40:17 ovmsrv05 multipathd: dm-15: remove map (uevent) > Feb 2 10:40:17 ovmsrv05 multipathd: dm-15: devmap not registered, can't > remove > Feb 2 10:40:17 ovmsrv05 multipathd: dm-15: remove map (uevent) > Feb 2 10:40:22 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! Lets file a bug to investigate these "kernel: dd: sending ioctl 80306d02 to a partition!" messages. Please attach vdsm log on the machine emitting theses logs. >> > After about 7 rounds I get this in messages of the host where the VM is >> > running: >> > >> > Feb 1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a >> > partition! >> > Feb 1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a >> > partition! >> > Feb 1 23:31:44 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a >> > partition! >> > Feb 1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a >> > partition! >> > Feb 1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a >> > partition! >> > Feb 1 23:31:47 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a >> > partition! >> > Feb 1 23:31:50 ovmsrv06 kernel: dd: sending
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Thu, Feb 2, 2017 at 12:04 PM, Gianluca Cecchi wrote: > > > On Thu, Feb 2, 2017 at 10:48 AM, Nir Soffer wrote: > >> On Thu, Feb 2, 2017 at 1:11 AM, Gianluca Cecchi >> wrote: >> > On Wed, Feb 1, 2017 at 8:22 PM, Gianluca Cecchi < >> gianluca.cec...@gmail.com> >> > wrote: >> >> >> >> >> >> OK. In the mean time I have applied your suggested config and restarted >> >> the 2 nodes. >> >> Let we test and see if I find any problems running also some I/O tests. >> >> Thanks in the mean time, >> >> Gianluca >> > >> > >> > >> > Quick test without much success >> > >> > Inside the guest I run this loop >> > while true >> > do >> > time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile >> > A single 'dd' rarely saturates a high performance storage. There are better utilities to test ('fio' , 'vdbench' and 'ddpt' for example). It's also testing a very theoretical scenario - very rarely you write zeros and very rarely you write so much sequential IO, and with a fixed block size. So it's almost 'hero numbers'. > sleep 5 >> > done >> >> I don't think this test is related to the issues you reported earlier. >> >> > I thought the same too, and all related comments you wrote. > I'm going to test the suggested modifications for chunks. > In general do you recommend thin provisioning at all on SAN storage? > Depends on your SAN. On thin provisioned one (with potentially inline dedup and compression, such as XtremIO, Pure, Nimble and others) I don't see a great value in thin provisioning. > > I decided to switch to preallocated for further tests and confirm > So I created a snapshot and then a clone of the VM, changing allocation > policy of the disk to preallocated. > So far so good. > > Feb 2, 2017 10:40:23 AM VM ol65preallocated creation has been completed. > Feb 2, 2017 10:24:15 AM VM ol65preallocated creation was initiated by > admin@internal-authz. > Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' has > been completed. > Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' was > initiated by admin@internal-authz. > > so the throughput seems ok based on this storage type (the LUNs are on > RAID5 made with sata disks): 16 minutes to write 90Gb is about 96MBytes/s, > what expected > What is your expectation? Is it FC, iSCSI? How many paths? What is the IO scheduler in the VM? Is it using virtio-blk or virtio-SCSI? Y. > > What I see in messages during the cloning phasefrom 10:24 to 10:40: > > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:24:14 ovmsrv05 journal: vdsm root WARNING File: > /rhev/data-center/588237b8-0031-02f6-035d-0136/ > 922b5269-ab56-4c4d-838f-49d33427e2ab/images/9d1c977f- > 540d-436a-9d93-b1cb0816af2a/607dbf59-7d4d-4fc3-ae5f-e8824bf82648 already > removed > Feb 2 10:24:14 ovmsrv05 multipathd: dm-15: remove map (uevent) > Feb 2 10:24:14 ovmsrv05 multipathd: dm-15: devmap not registered, can't > remove > Feb 2 10:24:14 ovmsrv05 multipathd: dm-15: remove map (uevent) > Feb 2 10:24:17 ovmsrv05 kernel: blk_update_request: critical target > error, dev dm-4, sector 44566529 > Feb 2 10:24:17 ovmsrv05 kernel: dm-15: WRITE SAME failed. Manually > zeroing. > Feb 2 10:40:07 ovmsrv05 kernel: scsi_verify_blk_ioctl: 16 callbacks > suppressed > Feb 2 10:40:07 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 2 10:40:17 ovmsrv05 multipathd: dm-15: remove map (uevent) > Feb 2 10:40:17 ovmsrv05 multipathd: dm-15: devmap not registered, can't > remove > Feb 2 10:40:17 ovmsrv05 multipathd: dm-15: remove map (uevent) > Feb 2 10:40:22 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > > > > >> > After about 7 rounds I get this in messages of the host where the VM is >> > running: >> > >> > Feb 1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a >> partition! >> > Feb 1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a >> partition! >> > Feb 1 23:31:44 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a >> partition! >> > Feb 1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a >> partition! >> > Feb 1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a >> partition! >> > Feb 1 23:31:47 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a >> partition! >> > Feb 1 2
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Thu, Feb 2, 2017 at 10:48 AM, Nir Soffer wrote: > On Thu, Feb 2, 2017 at 1:11 AM, Gianluca Cecchi > wrote: > > On Wed, Feb 1, 2017 at 8:22 PM, Gianluca Cecchi < > gianluca.cec...@gmail.com> > > wrote: > >> > >> > >> OK. In the mean time I have applied your suggested config and restarted > >> the 2 nodes. > >> Let we test and see if I find any problems running also some I/O tests. > >> Thanks in the mean time, > >> Gianluca > > > > > > > > Quick test without much success > > > > Inside the guest I run this loop > > while true > > do > > time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile > > sleep 5 > > done > > I don't think this test is related to the issues you reported earlier. > > I thought the same too, and all related comments you wrote. I'm going to test the suggested modifications for chunks. In general do you recommend thin provisioning at all on SAN storage? I decided to switch to preallocated for further tests and confirm So I created a snapshot and then a clone of the VM, changing allocation policy of the disk to preallocated. So far so good. Feb 2, 2017 10:40:23 AM VM ol65preallocated creation has been completed. Feb 2, 2017 10:24:15 AM VM ol65preallocated creation was initiated by admin@internal-authz. Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' has been completed. Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' was initiated by admin@internal-authz. so the throughput seems ok based on this storage type (the LUNs are on RAID5 made with sata disks): 16 minutes to write 90Gb is about 96MBytes/s, what expected What I see in messages during the cloning phasefrom 10:24 to 10:40: Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! Feb 2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! Feb 2 10:24:14 ovmsrv05 journal: vdsm root WARNING File: /rhev/data-center/588237b8-0031-02f6-035d-0136/922b5269-ab56-4c4d-838f-49d33427e2ab/images/9d1c977f-540d-436a-9d93-b1cb0816af2a/607dbf59-7d4d-4fc3-ae5f-e8824bf82648 already removed Feb 2 10:24:14 ovmsrv05 multipathd: dm-15: remove map (uevent) Feb 2 10:24:14 ovmsrv05 multipathd: dm-15: devmap not registered, can't remove Feb 2 10:24:14 ovmsrv05 multipathd: dm-15: remove map (uevent) Feb 2 10:24:17 ovmsrv05 kernel: blk_update_request: critical target error, dev dm-4, sector 44566529 Feb 2 10:24:17 ovmsrv05 kernel: dm-15: WRITE SAME failed. Manually zeroing. Feb 2 10:40:07 ovmsrv05 kernel: scsi_verify_blk_ioctl: 16 callbacks suppressed Feb 2 10:40:07 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! Feb 2 10:40:17 ovmsrv05 multipathd: dm-15: remove map (uevent) Feb 2 10:40:17 ovmsrv05 multipathd: dm-15: devmap not registered, can't remove Feb 2 10:40:17 ovmsrv05 multipathd: dm-15: remove map (uevent) Feb 2 10:40:22 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition! > > After about 7 rounds I get this in messages of the host where the VM is > > running: > > > > Feb 1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:44 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:47 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:56 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > Feb 1 23:31:57 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a > partition! > > This is interesting, we have seen this messages before, but could never > detect the flow causing them, are you sure you see this each time you > extend > your disk? > > If you can reproduce this, please file a bug. > > Ok, see also above the registered message during the clone phase. Gianluca ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Thu, Feb 2, 2017 at 1:11 AM, Gianluca Cecchi wrote: > On Wed, Feb 1, 2017 at 8:22 PM, Gianluca Cecchi > wrote: >> >> >> OK. In the mean time I have applied your suggested config and restarted >> the 2 nodes. >> Let we test and see if I find any problems running also some I/O tests. >> Thanks in the mean time, >> Gianluca > > > > Quick test without much success > > Inside the guest I run this loop > while true > do > time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile > sleep 5 > done I don't think this test is related to the issues you reported earlier. What you test here is how fast ovirt thin provisioning can extend your disk when writing zeros. We don't handle that very well, each extend needs at couple 4-6 seconds to complete before we refresh the lv on the host running the vm, and this is the *best* case. In the worst case it can take much longer. Also what you tested here is how fast you can write to your vm buffer cache, since you are not using direct io. A better way to perform this test is: time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile oflag=direct This will give you the time to actually write data to storage. If you have a real issue with vms pausing during writes when vm disk has to be extended, you can enlarge the extend chunk, 1GiB by default. To use chunks of 2GiB, set: [irs] volume_utilization_percent = 50 volume_utilization_chunk_mb = 2048 This will extend the drive when free space is less than 1024MiB (volume_utilization_chunk_mb * (100 - volume_utilization_percent) / 100) If this is not enough, you can also use lower volume_utilization_percent, for example, this will extend the disk in 2GiB chunks when free space is bellow 1536MiB: [irs] volume_utilization_percent = 25 volume_utilization_chunk_mb = 2048 > BTW: my home is inside / filesystem on guest that has space to accomodate > 1Gb of the dd command: > [g.cecchi@ol65 ~]$ df -h /home/g.cecchi/ > FilesystemSize Used Avail Use% Mounted on > /dev/mapper/vg_ol65-lv_root >20G 4.9G 14G 27% / > [g.cecchi@ol65 ~]$ > > After about 7 rounds I get this in messages of the host where the VM is > running: > > Feb 1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 1 23:31:44 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 1 23:31:47 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 1 23:31:56 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! > Feb 1 23:31:57 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! This is interesting, we have seen this messages before, but could never detect the flow causing them, are you sure you see this each time you extend your disk? If you can reproduce this, please file a bug. > > Nothing on the other host. > In web admin events pane: > Feb 1, 2017 11:31:44 PM VM ol65 has been paused due to no Storage space > error. > Feb 1, 2017 11:31:44 PM VM ol65 has been paused. > > I stop the dd loop and after some seconds: > Feb 1, 2017 11:32:32 PM VM ol65 has recovered from paused back to up > > Multipath status for my device: > > 3600a0b8000299aa8d08b55014119 dm-2 IBM ,1814 FAStT > size=4.0T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 > rdac' wp=rw > |-+- policy='service-time 0' prio=0 status=active > | |- 0:0:1:3 sdj 8:144 active undef running > | `- 2:0:1:3 sdp 8:240 active undef running > `-+- policy='service-time 0' prio=0 status=enabled > |- 0:0:0:3 sdd 8:48 active undef running > `- 2:0:0:3 sdk 8:160 active undef running > > In engine.log > > 2017-02-01 23:22:01,449 INFO > [org.ovirt.engine.core.bll.aaa.CreateUserSessionCommand] (default task-15) > [530aee87] Running command: CreateUserSessionCommand internal: false. > 2017-02-01 23:22:04,011 INFO > [org.ovirt.engine.docs.utils.servlet.ContextSensitiveHelpMappingServlet] > (default task-12) [] Context-sensitive help is not installed. Manual > directory doesn't exist: /usr/share/ovirt-engine/manual > 2017-02-01 23:31:43,936 INFO > [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] > (ForkJoinPool-1-worker-10) [10c15e39] VM > '932db7c7-4121-4cbe-ad8d-09e4e99b3cdd'(ol65) moved from 'Up' --> 'Paused' > 2017-02-01 23:31:44,087 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (ForkJoinPool-1-worker-10) [10c15e39] Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: VM ol65 has been paused. > 2017-02-01 23:31:44,227 ERROR > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (ForkJoinPool-1-worker-10) [10c
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Wed, Feb 1, 2017 at 8:22 PM, Gianluca Cecchi wrote: > > OK. In the mean time I have applied your suggested config and restarted > the 2 nodes. > Let we test and see if I find any problems running also some I/O tests. > Thanks in the mean time, > Gianluca > Quick test without much success Inside the guest I run this loop while true do time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile sleep 5 done BTW: my home is inside / filesystem on guest that has space to accomodate 1Gb of the dd command: [g.cecchi@ol65 ~]$ df -h /home/g.cecchi/ FilesystemSize Used Avail Use% Mounted on /dev/mapper/vg_ol65-lv_root 20G 4.9G 14G 27% / [g.cecchi@ol65 ~]$ After about 7 rounds I get this in messages of the host where the VM is running: Feb 1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! Feb 1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! Feb 1 23:31:44 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! Feb 1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! Feb 1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! Feb 1 23:31:47 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! Feb 1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! Feb 1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! Feb 1 23:31:56 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! Feb 1 23:31:57 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition! Nothing on the other host. In web admin events pane: Feb 1, 2017 11:31:44 PM VM ol65 has been paused due to no Storage space error. Feb 1, 2017 11:31:44 PM VM ol65 has been paused. I stop the dd loop and after some seconds: Feb 1, 2017 11:32:32 PM VM ol65 has recovered from paused back to up Multipath status for my device: 3600a0b8000299aa8d08b55014119 dm-2 IBM ,1814 FAStT size=4.0T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='service-time 0' prio=0 status=active | |- 0:0:1:3 sdj 8:144 active undef running | `- 2:0:1:3 sdp 8:240 active undef running `-+- policy='service-time 0' prio=0 status=enabled |- 0:0:0:3 sdd 8:48 active undef running `- 2:0:0:3 sdk 8:160 active undef running In engine.log 2017-02-01 23:22:01,449 INFO [org.ovirt.engine.core.bll.aaa.CreateUserSessionCommand] (default task-15) [530aee87] Running command: CreateUserSessionCommand internal: false. 2017-02-01 23:22:04,011 INFO [org.ovirt.engine.docs.utils.servlet.ContextSensitiveHelpMappingServlet] (default task-12) [] Context-sensitive help is not installed. Manual directory doesn't exist: /usr/share/ovirt-engine/manual 2017-02-01 23:31:43,936 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-10) [10c15e39] VM '932db7c7-4121-4cbe-ad8d-09e4e99b3cdd'(ol65) moved from 'Up' --> 'Paused' 2017-02-01 23:31:44,087 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-10) [10c15e39] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM ol65 has been paused. 2017-02-01 23:31:44,227 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-10) [10c15e39] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM ol65 has been paused due to no Storage space error. 2017-02-01 23:32:32,100 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler5) [1c02a371] VM '932db7c7-4121-4cbe-ad8d-09e4e99b3cdd'(ol65) moved from 'Paused' --> 'Up' 2017-02-01 23:32:32,259 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler5) [1c02a371] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM ol65 has recovered from paused back to up. In vdsm.log of the host where the VM is running I see this: mailbox.SPMMonitor::DEBUG::2017-02-01 23:31:37,729::storage_mailbox: :733::Storage.Misc.excCmd::(_chec kForMail) /usr/bin/taskset --cpu-list 0-7 dd if=/rhev/data-center/588237b8- 0031-02f6-035d-013 6/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) mailbox.SPMMonitor::DEBUG::2017-02-01 23:31:37,764::storage_mailbox: :733::Storage.Misc.excCmd::(_chec kForMail) SUCCESS: = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0061866 s, 166 MB/s\n'; = 0 periodic/8::INFO::2017-02-01 23:31:39,699::vm::1028::virt.vm::(extendDrivesIfNeeded) vmId=`932db7c7-4 121-4cbe-ad8d-09e4e99b3cdd`::Requesting extension for volume 66615920-619f-4f52-ad5a-3f062c054094 on domain 922b5269-ab56-4c4d-838f-49d33427e2ab (apparent: 12884901888, capacity: 96636764160, allocated: 12355698688, physical: 12884901888) periodic/8::DEBUG::2017-02-01 23:31:39,699::vm::1093::virt.vm::(__extendDriveVolume) vmId=`932db7c7-4 121-4cbe-ad8d-09e4e99b3cdd`::Requesting an extension for the volume: {'newSize': 13958643712, 'domain ID': u'922
Re: [ovirt-users] VM has been paused due to storage I/O problem
I'm also seeing this error using a Dell MD3800i array. The multipath errors shown in our logs are different however. Feb 1 15:11:58 ovirt-node-production2 kernel: dd: sending ioctl 80306d02 to a partition! Feb 1 15:21:01 ovirt-node-production2 multipathd: dm-31: remove map (uevent) Feb 1 15:21:01 ovirt-node-production2 multipathd: dm-31: devmap not registered, can't remove Feb 1 15:21:01 ovirt-node-production2 multipathd: dm-31: remove map (uevent) The dd error seems to happen every time that SPM runs a test. On 01/31/2017 09:23 AM, Nathanaël Blanchet wrote: > > exactly the same issue by there with FC EMC domain storage... > > > Le 31/01/2017 à 15:20, Gianluca Cecchi a écrit : >> Hello, >> my test environment is composed by 2 old HP blades BL685c G1 >> (ovmsrv05 and ovmsrv06) and they are connected in a SAN with >> FC-switches to an old IBM DS4700 storage array. >> Apart from being old, they seem all ok from an hw point of view. >> I have configured oVirt 4.0.6 and an FCP storage domain. >> The hosts are plain CentOS 7.3 servers fully updated. >> It is not an hosted engine environment: the manager is a vm outside >> of the cluster. >> I have configured power mgmt on both and it works good. >> >> I have at the moment only one VM for test and it is doing quite nothing. >> >> Starting point: ovmsrv05 is in maintenance (since about 2 days) and >> the VM is running on ovmsrv06. >> I update qemu-kvm package on ovmsrv05 and then I restart it from web >> admin gui: >> Power Mgmt --> Restart >> >> Sequence of events in pane and the problem in subject: >> Jan 31, 2017 10:29:43 AM Host ovmsrv05 power management was verified >> successfully. >> Jan 31, 2017 10:29:43 AM Status of host ovmsrv05 was set to Up. >> Jan 31, 2017 10:29:38 AM Executing power management status on Host >> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. >> Jan 31, 2017 10:29:29 AM Activation of host ovmsrv05 initiated by >> admin@internal-authz. >> Jan 31, 2017 10:28:05 AM VM ol65 has recovered from paused back to up. >> Jan 31, 2017 10:27:55 AM VM ol65 has been paused due to storage I/O >> problem. >> Jan 31, 2017 10:27:55 AM VM ol65 has been paused. >> Jan 31, 2017 10:25:52 AM Host ovmsrv05 was restarted by >> admin@internal-authz. >> Jan 31, 2017 10:25:52 AM Host ovmsrv05 was started by >> admin@internal-authz. >> Jan 31, 2017 10:25:52 AM Power management start of Host ovmsrv05 >> succeeded. >> Jan 31, 2017 10:25:50 AM Executing power management status on Host >> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. >> Jan 31, 2017 10:25:37 AM Executing power management start on Host >> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. >> Jan 31, 2017 10:25:37 AM Power management start of Host ovmsrv05 >> initiated. >> Jan 31, 2017 10:25:37 AM Auto fence for host ovmsrv05 was started. >> Jan 31, 2017 10:25:37 AM All VMs' status on Non Responsive Host >> ovmsrv05 were changed to 'Down' by admin@internal-authz >> Jan 31, 2017 10:25:36 AM Host ovmsrv05 was stopped by >> admin@internal-authz. >> Jan 31, 2017 10:25:36 AM Power management stop of Host ovmsrv05 >> succeeded. >> Jan 31, 2017 10:25:34 AM Executing power management status on Host >> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. >> Jan 31, 2017 10:25:15 AM Executing power management stop on Host >> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. >> Jan 31, 2017 10:25:15 AM Power management stop of Host ovmsrv05 >> initiated. >> Jan 31, 2017 10:25:12 AM Executing power management status on Host >> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. >> >> Watching the timestamps, the culprit seems the reboot time of >> ovmsrv05 that detects some LUNs in owned state and other ones in unowned >> Full messages of both hosts here: >> https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NMRlU/view?usp=sharing >> and >> https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdXTms/view?usp=sharing >> >> At this time there are 4 LUNs globally seen by the two hosts but only >> 1 of them is currently configured as the only storage domain in oVirt >> cluster. >> >> [root@ovmsrv05 ~]# multipath -l | grep ^36 >> 3600a0b8000299aa8d08b55014119 dm-5 IBM ,1814 FAStT >> 3600a0b8000299902cd3c5501458f dm-3 IBM ,1814 FAStT >> 3600a0b8000299902ccf855011198 dm-2 IBM ,1814 FAStT >> 3600a0b8000299aa8d08955014098 dm-4 IBM ,1814 FAStT >> >> the configured one: >> [root@ovmsrv05 ~]# multipath -l 3600a0b8000299aa8d08b55014119 >> 3600a0b8000299aa8d08b55014119 dm-5 IBM ,1814 FAStT >> size=4.0T features='0' hwhandler='1 rdac' wp=rw >> |-+- policy='service-time 0' prio=0 status=active >> | |- 0:0:1:3 sdl 8:176 active undef running >> | `- 2:0:1:3 sdp 8:240 active undef running >> `-+- policy='service-time 0' prio=0 status=enabled >> |- 0:0:0:3 sdd 8:48 active undef running >> `- 2:0:0:3 sdi 8:128 active undef running
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Wed, Feb 1, 2017 at 8:39 AM, Nir Soffer wrote: > > > Hi Gianluca, > > This should be a number, not a string, maybe multipath is having trouble > parsing this and it ignores your value? > I don't think so. Also because reading dm multipath guide at https://access.redhat.com/documentation/en-US/Red_Hat_Enterp rise_Linux/7/html/DM_Multipath/multipath_config_confirm.html It seems that in RH EL 7.3 the "show config" command has this behaviour: " For example, the following command sequence displays the multipath configuration, including the defaults, before exiting the console. # multipathd -k > > show config > > CTRL-D " So the output has to include the default too. Anyway I changed it, see below < BEGIN OF PARENTHESIS In theory it should be the same on RH EL 6.8 (see https://access.redhat.com/documentation/en-US/Red_Hat_Enterp rise_Linux/6/html/DM_Multipath/multipath_config_confirm.html ) but it is not so for me on a system that is on 6.5, with device-mapper-multipath-0.4.9-93.el6.x86_64 and connected to Netapp In /usr/share/doc/device-mapper-multipath-0.4.9/multipath.conf.defaults # vendor "NETAPP" # product "LUN.*" # path_grouping_policy group_by_prio # getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n" # path_selector "round-robin 0" # path_checker tur # features "3 queue_if_no_path pg_init_retries 50" # hardware_handler "0" # prio ontap # failback immediate # rr_weight uniform # rr_min_io 128 # rr_min_io_rq 1 # flush_on_last_del yes # fast_io_fail_tmo 5 # dev_loss_tmo infinity # retain_attached_hw_handler yes # detect_prio yes # reload_readwrite yes # } My customization in multipath.conf, based on Netapp guidelines and my Netapp storage array setup: devices { device { vendor "NETAPP" product "LUN.*" getuid_callout "/lib/udev/scsi_id -g -u -d /dev/%n" hardware_handler "1 alua" prio alua } If I run "multipathd show config" I see only 1 entry for NETAPP/LUN vendor/product and it is a merge of default and my custom. device { vendor "NETAPP" product "LUN.*" path_grouping_policy group_by_prio getuid_callout "/lib/udev/scsi_id -g -u -d /dev/%n" path_selector "round-robin 0" path_checker tur features "3 queue_if_no_path pg_init_retries 50" hardware_handler "1 alua" prio alua failback immediate rr_weight uniform rr_min_io 128 rr_min_io_rq 1 flush_on_last_del yes fast_io_fail_tmo 5 dev_loss_tmo infinity retain_attached_hw_handler yes detect_prio yes reload_readwrite yes } So this difference confused me when configuring multipath in CentOS 7.3. I have to see when I'm going to update from 6.5 to 6.8 if this changes. < END OF PARENTHESIS > > } > > } > > > > So I put exactly the default device config for my IBM/1814 device but > > no_path_retry set to 12. > > Why 12? > > This will do 12 retries, 5 seconds each when no path is available. This > will > block lvm commands for 60 seconds when no path is available, blocking > other stuff in vdsm. Vdsm is not designed to handle this. > > I recommend value of 4. > OK. > But note that this will is not related to the fact that your devices are > not > initialize properly after boot. > In fact it could be also a ds4700 overall problem The LUNs are configured as LNX CLUSTER type, that should be ok in theory, even if this kind of storage was never so supported with Linux. Initially one had to use proprietary IBM kernel modules/drivers. I will see consistency and robustness through testing. I have to do a POC and this is the hw I have and I should at least try to have a working solution for it. > > > In CentOS 6.x when you do something like this, "show config" gives you > the > > modified entry only for your device section. > > Instead in CentOS 7.3 it seems I get anyway the default one for IBM/1814 > and > > also the customized one at the end of the output > > Maybe your device configuration does not match exactly the builtin config. > I think it is the different behaviour as outlined above. I think you can confirm in another system where some customization has been done too... > > > Maybe waiting a moment helps the storage/switches to clean up > properly after a server is shut down? > I think so too. Eventually when possible, if errors repeat with the new config, I'll manage to do stop/start instead of restart > > Does your power manageme
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Tue, Jan 31, 2017 at 6:09 PM, Gianluca Cecchi wrote: > On Tue, Jan 31, 2017 at 3:23 PM, Nathanaël Blanchet > wrote: >> >> exactly the same issue by there with FC EMC domain storage... >> >> > > I'm trying to mitigate inserting a timeout for my SAN devices but I'm not > sure of its effectiveness as CentOS 7 behavior of "multipathd -k" and then > "show config" seems different from CentOS 6.x > In fact my attempt for multipath.conf is this > > > # VDSM REVISION 1.3 > # VDSM PRIVATE > > defaults { > polling_interval5 > no_path_retry fail > user_friendly_names no > flush_on_last_del yes > fast_io_fail_tmo5 > dev_loss_tmo30 > max_fds 4096 > } > > # Remove devices entries when overrides section is available. > devices { > device { > # These settings overrides built-in devices settings. It does not > apply > # to devices without built-in settings (these use the settings in > the > # "defaults" section), or to devices defined in the "devices" > section. > # Note: This is not available yet on Fedora 21. For more info see > # https://bugzilla.redhat.com/1253799 > all_devsyes > no_path_retry fail > } > device { > vendor "IBM" > product "^1814" > product_blacklist "Universal Xport" > path_grouping_policy "group_by_prio" > path_checker "rdac" > features "0" > hardware_handler "1 rdac" > prio "rdac" > failback immediate > rr_weight "uniform" > no_path_retry "12" Hi Gianluca, This should be a number, not a string, maybe multipath is having trouble parsing this and it ignores your value? > } > } > > So I put exactly the default device config for my IBM/1814 device but > no_path_retry set to 12. Why 12? This will do 12 retries, 5 seconds each when no path is available. This will block lvm commands for 60 seconds when no path is available, blocking other stuff in vdsm. Vdsm is not designed to handle this. I recommend value of 4. But note that this will is not related to the fact that your devices are not initialize properly after boot. > In CentOS 6.x when you do something like this, "show config" gives you the > modified entry only for your device section. > Instead in CentOS 7.3 it seems I get anyway the default one for IBM/1814 and > also the customized one at the end of the output Maybe your device configuration does not match exactly the builtin config. > > Two facts: > - before I could reproduce the problem if I selected > Maintenance > Power Mgmt ---> Restart > (tried 3 times with same behavior) > > Instead if I executed in separate steps > Maintenance > Power Mgmt --> Stop > wait a moment > Power Mgmt --> Start > > I didn't get problems (tried only one time...) Maybe waiting a moment helps the storage/switches to clean up properly after a server is shut down? Does your power management trigger a proper shutdown? I would avoid using it for normal shutdown. > > With this "new" multipath config (to be confirmed if in effect, how?) I > don't get the VM paused problem even with Restart option of Power Mgmt > In active host messages I see these ones when the other reboots: > > Jan 31 16:50:01 ovmsrv06 systemd: Started Session 705 of user root. > Jan 31 16:50:01 ovmsrv06 systemd: Starting Session 705 of user root. > Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sde > - rdac checker reports path is up > Jan 31 16:53:47 ovmsrv06 multipathd: 8:64: reinstated > Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: load > table [0 41943040 multipath 1 queue_if_no_path 1 rdac 2 1 service-time 0 2 1 > 8:224 1 65:0 1 service-time 0 2 1 8:64 1 8:160 1] > Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdo > - rdac checker reports path is ghost > Jan 31 16:53:47 ovmsrv06 multipathd: 8:224: reinstated > Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdk > - rdac checker reports path is up > Jan 31 16:53:47 ovmsrv06 multipathd: 8:160: reinstated > Jan 31 16:53:47 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, > queueing MODE_SELECT command > Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdq > - rdac checker reports path is ghost > Jan 31 16:53:47 ovmsrv06 multipathd: 65:0: reinstated > Jan 31 16:53:48 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, > MODE_SELECT returned with sense 05/91/36 > Jan 31 16:53:48 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, > queueing MODE_SELECT command > Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, > MODE_SELECT returned with sense 05/91/36 > Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_
Re: [ovirt-users] VM has been paused due to storage I/O problem
On Tue, Jan 31, 2017 at 3:23 PM, Nathanaël Blanchet wrote: > exactly the same issue by there with FC EMC domain storage... > > I'm trying to mitigate inserting a timeout for my SAN devices but I'm not sure of its effectiveness as CentOS 7 behavior of "multipathd -k" and then "show config" seems different from CentOS 6.x In fact my attempt for multipath.conf is this # VDSM REVISION 1.3 # VDSM PRIVATE defaults { polling_interval5 no_path_retry fail user_friendly_names no flush_on_last_del yes fast_io_fail_tmo5 dev_loss_tmo30 max_fds 4096 } # Remove devices entries when overrides section is available. devices { device { # These settings overrides built-in devices settings. It does not apply # to devices without built-in settings (these use the settings in the # "defaults" section), or to devices defined in the "devices" section. # Note: This is not available yet on Fedora 21. For more info see # https://bugzilla.redhat.com/1253799 all_devsyes no_path_retry fail } device { vendor "IBM" product "^1814" product_blacklist "Universal Xport" path_grouping_policy "group_by_prio" path_checker "rdac" features "0" hardware_handler "1 rdac" prio "rdac" failback immediate rr_weight "uniform" no_path_retry "12" } } So I put exactly the default device config for my IBM/1814 device but no_path_retry set to 12. In CentOS 6.x when you do something like this, "show config" gives you the modified entry only for your device section. Instead in CentOS 7.3 it seems I get anyway the default one for IBM/1814 and also the customized one at the end of the output Two facts: - before I could reproduce the problem if I selected Maintenance Power Mgmt ---> Restart (tried 3 times with same behavior) Instead if I executed in separate steps Maintenance Power Mgmt --> Stop wait a moment Power Mgmt --> Start I didn't get problems (tried only one time...) With this "new" multipath config (to be confirmed if in effect, how?) I don't get the VM paused problem even with Restart option of Power Mgmt In active host messages I see these ones when the other reboots: Jan 31 16:50:01 ovmsrv06 systemd: Started Session 705 of user root. Jan 31 16:50:01 ovmsrv06 systemd: Starting Session 705 of user root. Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sde - rdac checker reports path is up Jan 31 16:53:47 ovmsrv06 multipathd: 8:64: reinstated Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: load table [0 41943040 multipath 1 queue_if_no_path 1 rdac 2 1 service-time 0 2 1 8:224 1 65:0 1 service-time 0 2 1 8:64 1 8:160 1] Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdo - rdac checker reports path is ghost Jan 31 16:53:47 ovmsrv06 multipathd: 8:224: reinstated Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdk - rdac checker reports path is up Jan 31 16:53:47 ovmsrv06 multipathd: 8:160: reinstated Jan 31 16:53:47 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, queueing MODE_SELECT command Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdq - rdac checker reports path is ghost Jan 31 16:53:47 ovmsrv06 multipathd: 65:0: reinstated Jan 31 16:53:48 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, MODE_SELECT returned with sense 05/91/36 Jan 31 16:53:48 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, queueing MODE_SELECT command Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, MODE_SELECT returned with sense 05/91/36 Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, queueing MODE_SELECT command Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1, MODE_SELECT completed Jan 31 16:53:49 ovmsrv06 kernel: sd 2:0:1:4: rdac: array Z1_DS4700, ctlr 1, queueing MODE_SELECT command Jan 31 16:53:49 ovmsrv06 kernel: sd 2:0:1:4: rdac: array Z1_DS4700, ctlr 1, MODE_SELECT completed Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sde - rdac checker reports path is ghost Jan 31 16:53:52 ovmsrv06 multipathd: 8:64: reinstated Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: load table [0 41943040 multipath 1 queue_if_no_path 1 rdac 2 1 service-time 0 2 1 8:224 1 65:0 1 service-time 0 2 1 8:64 1 8:160 1] Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdo - rdac checker reports path is up Jan 31 16:53:52 ovmsrv06 multipathd: 8:224: reinstated Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdk - rdac checker reports path is ghost Jan 31 16:53:52 ov
Re: [ovirt-users] VM has been paused due to storage I/O problem
exactly the same issue by there with FC EMC domain storage... Le 31/01/2017 à 15:20, Gianluca Cecchi a écrit : Hello, my test environment is composed by 2 old HP blades BL685c G1 (ovmsrv05 and ovmsrv06) and they are connected in a SAN with FC-switches to an old IBM DS4700 storage array. Apart from being old, they seem all ok from an hw point of view. I have configured oVirt 4.0.6 and an FCP storage domain. The hosts are plain CentOS 7.3 servers fully updated. It is not an hosted engine environment: the manager is a vm outside of the cluster. I have configured power mgmt on both and it works good. I have at the moment only one VM for test and it is doing quite nothing. Starting point: ovmsrv05 is in maintenance (since about 2 days) and the VM is running on ovmsrv06. I update qemu-kvm package on ovmsrv05 and then I restart it from web admin gui: Power Mgmt --> Restart Sequence of events in pane and the problem in subject: Jan 31, 2017 10:29:43 AM Host ovmsrv05 power management was verified successfully. Jan 31, 2017 10:29:43 AM Status of host ovmsrv05 was set to Up. Jan 31, 2017 10:29:38 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:29:29 AM Activation of host ovmsrv05 initiated by admin@internal-authz. Jan 31, 2017 10:28:05 AM VM ol65 has recovered from paused back to up. Jan 31, 2017 10:27:55 AM VM ol65 has been paused due to storage I/O problem. Jan 31, 2017 10:27:55 AM VM ol65 has been paused. Jan 31, 2017 10:25:52 AM Host ovmsrv05 was restarted by admin@internal-authz. Jan 31, 2017 10:25:52 AM Host ovmsrv05 was started by admin@internal-authz. Jan 31, 2017 10:25:52 AM Power management start of Host ovmsrv05 succeeded. Jan 31, 2017 10:25:50 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:25:37 AM Executing power management start on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:25:37 AM Power management start of Host ovmsrv05 initiated. Jan 31, 2017 10:25:37 AM Auto fence for host ovmsrv05 was started. Jan 31, 2017 10:25:37 AM All VMs' status on Non Responsive Host ovmsrv05 were changed to 'Down' by admin@internal-authz Jan 31, 2017 10:25:36 AM Host ovmsrv05 was stopped by admin@internal-authz. Jan 31, 2017 10:25:36 AM Power management stop of Host ovmsrv05 succeeded. Jan 31, 2017 10:25:34 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:25:15 AM Executing power management stop on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:25:15 AM Power management stop of Host ovmsrv05 initiated. Jan 31, 2017 10:25:12 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Watching the timestamps, the culprit seems the reboot time of ovmsrv05 that detects some LUNs in owned state and other ones in unowned Full messages of both hosts here: https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NMRlU/view?usp=sharing and https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdXTms/view?usp=sharing At this time there are 4 LUNs globally seen by the two hosts but only 1 of them is currently configured as the only storage domain in oVirt cluster. [root@ovmsrv05 ~]# multipath -l | grep ^36 3600a0b8000299aa8d08b55014119 dm-5 IBM ,1814 FAStT 3600a0b8000299902cd3c5501458f dm-3 IBM ,1814 FAStT 3600a0b8000299902ccf855011198 dm-2 IBM ,1814 FAStT 3600a0b8000299aa8d08955014098 dm-4 IBM ,1814 FAStT the configured one: [root@ovmsrv05 ~]# multipath -l 3600a0b8000299aa8d08b55014119 3600a0b8000299aa8d08b55014119 dm-5 IBM ,1814 FAStT size=4.0T features='0' hwhandler='1 rdac' wp=rw |-+- policy='service-time 0' prio=0 status=active | |- 0:0:1:3 sdl 8:176 active undef running | `- 2:0:1:3 sdp 8:240 active undef running `-+- policy='service-time 0' prio=0 status=enabled |- 0:0:0:3 sdd 8:48 active undef running `- 2:0:0:3 sdi 8:128 active undef running In mesages of booting node, arounf the problem registered by the storage: [root@ovmsrv05 ~]# grep owned /var/log/messages Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:1: rdac: LUN 1 (RDAC) (owned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:2: rdac: LUN 2 (RDAC) (owned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:3: rdac: LUN 3 (RDAC) (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:1: rdac: LUN 1 (RDAC) (owned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:4: rdac: LUN 4 (RDAC) (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:2: rdac: LUN 2 (RDAC) (owned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:1: rdac: LUN 1 (RDAC) (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:3: rdac: LUN 3 (RDAC) (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:4: rdac: LUN 4 (RDAC) (unowned) Jan 31 10:27:38
[ovirt-users] VM has been paused due to storage I/O problem
Hello, my test environment is composed by 2 old HP blades BL685c G1 (ovmsrv05 and ovmsrv06) and they are connected in a SAN with FC-switches to an old IBM DS4700 storage array. Apart from being old, they seem all ok from an hw point of view. I have configured oVirt 4.0.6 and an FCP storage domain. The hosts are plain CentOS 7.3 servers fully updated. It is not an hosted engine environment: the manager is a vm outside of the cluster. I have configured power mgmt on both and it works good. I have at the moment only one VM for test and it is doing quite nothing. Starting point: ovmsrv05 is in maintenance (since about 2 days) and the VM is running on ovmsrv06. I update qemu-kvm package on ovmsrv05 and then I restart it from web admin gui: Power Mgmt --> Restart Sequence of events in pane and the problem in subject: Jan 31, 2017 10:29:43 AM Host ovmsrv05 power management was verified successfully. Jan 31, 2017 10:29:43 AM Status of host ovmsrv05 was set to Up. Jan 31, 2017 10:29:38 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:29:29 AM Activation of host ovmsrv05 initiated by admin@internal-authz. Jan 31, 2017 10:28:05 AM VM ol65 has recovered from paused back to up. Jan 31, 2017 10:27:55 AM VM ol65 has been paused due to storage I/O problem. Jan 31, 2017 10:27:55 AM VM ol65 has been paused. Jan 31, 2017 10:25:52 AM Host ovmsrv05 was restarted by admin@internal-authz. Jan 31, 2017 10:25:52 AM Host ovmsrv05 was started by admin@internal-authz. Jan 31, 2017 10:25:52 AM Power management start of Host ovmsrv05 succeeded. Jan 31, 2017 10:25:50 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:25:37 AM Executing power management start on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:25:37 AM Power management start of Host ovmsrv05 initiated. Jan 31, 2017 10:25:37 AM Auto fence for host ovmsrv05 was started. Jan 31, 2017 10:25:37 AM All VMs' status on Non Responsive Host ovmsrv05 were changed to 'Down' by admin@internal-authz Jan 31, 2017 10:25:36 AM Host ovmsrv05 was stopped by admin@internal-authz. Jan 31, 2017 10:25:36 AM Power management stop of Host ovmsrv05 succeeded. Jan 31, 2017 10:25:34 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:25:15 AM Executing power management stop on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Jan 31, 2017 10:25:15 AM Power management stop of Host ovmsrv05 initiated. Jan 31, 2017 10:25:12 AM Executing power management status on Host ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212. Watching the timestamps, the culprit seems the reboot time of ovmsrv05 that detects some LUNs in owned state and other ones in unowned Full messages of both hosts here: https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NMRlU/view?usp=sharing and https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdXTms/view?usp=sharing At this time there are 4 LUNs globally seen by the two hosts but only 1 of them is currently configured as the only storage domain in oVirt cluster. [root@ovmsrv05 ~]# multipath -l | grep ^36 3600a0b8000299aa8d08b55014119 dm-5 IBM ,1814 FAStT 3600a0b8000299902cd3c5501458f dm-3 IBM ,1814 FAStT 3600a0b8000299902ccf855011198 dm-2 IBM ,1814 FAStT 3600a0b8000299aa8d08955014098 dm-4 IBM ,1814 FAStT the configured one: [root@ovmsrv05 ~]# multipath -l 3600a0b8000299aa8d08b55014119 3600a0b8000299aa8d08b55014119 dm-5 IBM ,1814 FAStT size=4.0T features='0' hwhandler='1 rdac' wp=rw |-+- policy='service-time 0' prio=0 status=active | |- 0:0:1:3 sdl 8:176 active undef running | `- 2:0:1:3 sdp 8:240 active undef running `-+- policy='service-time 0' prio=0 status=enabled |- 0:0:0:3 sdd 8:48 active undef running `- 2:0:0:3 sdi 8:128 active undef running In mesages of booting node, arounf the problem registered by the storage: [root@ovmsrv05 ~]# grep owned /var/log/messages Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:1: rdac: LUN 1 (RDAC) (owned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:2: rdac: LUN 2 (RDAC) (owned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:3: rdac: LUN 3 (RDAC) (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:1: rdac: LUN 1 (RDAC) (owned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:4: rdac: LUN 4 (RDAC) (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:2: rdac: LUN 2 (RDAC) (owned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:1: rdac: LUN 1 (RDAC) (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:3: rdac: LUN 3 (RDAC) (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:4: rdac: LUN 4 (RDAC) (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:2: rdac: LUN 2 (RDAC) (unowned) Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:1: rdac: LUN 1 (RDAC) (unowned)