Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-04-04 Thread Gianluca Cecchi
On Thu, Feb 2, 2017 at 10:48 AM, Nir Soffer  wrote:

>
> >
> > After about 7 rounds I get this in messages of the host where the VM is
> > running:
> >
> > Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
> partition!
> > Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
> partition!
> > Feb  1 23:31:44 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
> partition!
> > Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
> partition!
> > Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
> partition!
> > Feb  1 23:31:47 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
> partition!
> > Feb  1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
> partition!
> > Feb  1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
> partition!
> > Feb  1 23:31:56 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
> partition!
> > Feb  1 23:31:57 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
> partition!
>
> This is interesting, we have seen this messages before, but could never
> detect the flow causing them, are you sure you see this each time you
> extend
> your disk?
>
> If you can reproduce this, please file a bug.
>


I got the same on 4.1.1 while testing on iSCSI so I opened a bugzilla
now... let's see if we can find anything usefule
https://bugzilla.redhat.com/show_bug.cgi?id=1438809

Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-03 Thread Benjamin Marzinski
On Fri, Feb 03, 2017 at 12:31:49AM +0100, Gianluca Cecchi wrote:
>On Thu, Feb 2, 2017 at 10:53 PM, Benjamin Marzinski
><[1]bmarz...@redhat.com> wrote:
> 
>  > > I'm trying to mitigate inserting a timeout for my SAN devices but
>  I'm not
>  > > sure of its effectiveness as CentOS 7 behavior  of "multipathd -k"
>  and then
>  > > "show config" seems different from CentOS 6.x
>  > > In fact my attempt for multipath.conf is this
> 
>  There was a significant change in how multipath deals with merging
>  device configurations between RHEL6 and RHEL7.  The short answer is, as
>  long as you copy the entire existing configuration, and just change what
>  you want changed (like you did), you can ignore the change.  Also,
>  multipath doesn't care if you quote numbers.
> 
>  If you want to verify that no_path_retry is being set as intented, you
>  can run:
> 
>  # multipath -r -v3 | grep no_path_retry
> 
>Hi Benjamin,
>thank you very much for the explanations, especially the long one ;-)
>I tried and confirmed that I has no_path_retry = 4 as expected
>The regex matching is only for merge, correct?

No. Both RHEL6 and RHEL7 use regex matching to determine which
device configuration to use with your device, otherwise

product "^1814"

would never match any device, since there is no array with a literal
product string of "^1814". RHEL7 also uses the same regex matching to
determine which builtin device configuration a user-supplied device
configuration should modify. RHEL6 uses string matching for this. 

>So in your example if in RH EL 7 I put this
>        device {
>                vendor "IBM"
>                product "^1814"
>                no_path_retry 12
>        }
>It would not match for merging, but it would match for applying to my
>device (because it is put at the end of config read backwards).

correct.  The confusing point is that in the merging case, "^1814" in
the user-supplied configuration is being treaded as a string that needs
to regex match the regular expression "^1814" in the builtin
configuration. These don't match. For matching the device configuration
to the device, "^1814" in the user-supplied configuration is being
treated as a regular expression that needs to regex match the actual
product string of the device.

>And it would apply only the no_path_retry setting, while all other ones
>would not be picked from builtin configuration for device, but from
>defaults in general.
>So for example it would set path_checker not this way:
>path_checker "rdac"
>but this way:
>path_checker "directio"
>that is default..
>correct?

exactly.

-Ben
 
> References
> 
>Visible links
>1. mailto:bmarz...@redhat.com
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-02 Thread Gianluca Cecchi
On Thu, Feb 2, 2017 at 10:53 PM, Benjamin Marzinski 
wrote:

>
> > > I'm trying to mitigate inserting a timeout for my SAN devices but I'm
> not
> > > sure of its effectiveness as CentOS 7 behavior  of "multipathd -k" and
> then
> > > "show config" seems different from CentOS 6.x
> > > In fact my attempt for multipath.conf is this
>
> There was a significant change in how multipath deals with merging
> device configurations between RHEL6 and RHEL7.  The short answer is, as
> long as you copy the entire existing configuration, and just change what
> you want changed (like you did), you can ignore the change.  Also,
> multipath doesn't care if you quote numbers.
>
> If you want to verify that no_path_retry is being set as intented, you
> can run:
>
> # multipath -r -v3 | grep no_path_retry
>

Hi Benjamin,
thank you very much for the explanations, especially the long one ;-)
I tried and confirmed that I has no_path_retry = 4 as expected

The regex matching is only for merge, correct?
So in your example if in RH EL 7 I put this

device {
vendor "IBM"
product "^1814"
no_path_retry 12
}

It would not match for merging, but it would match for applying to my
device (because it is put at the end of config read backwards).
And it would apply only the no_path_retry setting, while all other ones
would not be picked from builtin configuration for device, but from
defaults in general.
So for example it would set path_checker not this way:
path_checker "rdac"

but this way:
path_checker "directio"
that is default..

correct?
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-02 Thread Benjamin Marzinski
On Wed, Feb 01, 2017 at 09:39:45AM +0200, Nir Soffer wrote:
> On Tue, Jan 31, 2017 at 6:09 PM, Gianluca Cecchi
>  wrote:
> > On Tue, Jan 31, 2017 at 3:23 PM, Nathanaël Blanchet 
> > wrote:
> >>
> >> exactly the same issue by there with FC EMC domain storage...
> >>
> >>
> >
> > I'm trying to mitigate inserting a timeout for my SAN devices but I'm not
> > sure of its effectiveness as CentOS 7 behavior  of "multipathd -k" and then
> > "show config" seems different from CentOS 6.x
> > In fact my attempt for multipath.conf is this

There was a significant change in how multipath deals with merging
device configurations between RHEL6 and RHEL7.  The short answer is, as
long as you copy the entire existing configuration, and just change what
you want changed (like you did), you can ignore the change.  Also,
multipath doesn't care if you quote numbers.

If you want to verify that no_path_retry is being set as intented, you
can run:

# multipath -r -v3 | grep no_path_retry

To reload you multipath devices with verbosity turned up. You shoudl see
lines like:

Feb 02 09:38:30 | mpatha: no_path_retry = 12 (controller setting)

That will tell you what no_path_retry is set to.

The configuration Nir suggested at the end of this email looks good to
me.

Now, here's the long answer:

multipath allows you to merge device configurations.  This means that as
long as you put in the "vendor" and "product" strings, you only need to
set the other values that you care about. On RHEL6, this would work

device {
vendor "IBM"
product "^1814"
no_path_retry 12
}

And it would create a configuration that was exactly the same as the
builtin config for this device, except that no_path_retry was set to 12.
However, this wasn't as easy for users as it was supposed to be.
Specifically, users would often add their device's vendor and product
information, as well as whatever they wanted changed, and then be
surprised when multipath didn't retain all the information from the
builtin configuration as advertised. This is because they used the
actual vendor and product strings for their device, but the builtin
device configuration's vendor and product strings were regexes. In
RHEL6, multipath only merged configurations if the vendor and product
strings string matched. So users would try

device {
vendor "IBM"
product "1814 FASt"
no_path_retry 12
}

and it wouldn't work as expected, since the product strings didn't
match.  To fix this, when RHEL7 checks if a user configuration should be
merged with a builtin configuration, all that is required is that the
user configuration's vendor and product strings regex match the builtin.
This means that the above configuration will work as expected in RHEL7.
However the first configuration won't because "^1814" doesn't regex
match "^1814".  This means that multipath would treat is as a completely
new configuration, and not merge any values from the builtin
configuration.  You can reenable the RHEL6 behaviour in RHEL7 by setting 

hw_str_match yes

in the defaults section.

Now, because the builtin configurations could handle more than one
device type per configuration, since they used regexes to match the
vendor and product strings, multipath couldn't just remove the original
builtin configuration when users added a new configuration that modified
it.  Otherwise, devices that regex matched the builtin configuration's
vendor and product strings but not the user configuration's vendor and
product strings wouldn't have any device configuration information. So
multipath keeps the original builtin configuration as well as the new
one.  However, when it's time to assign a device configuration to a
device, multipath looks through the device configurations list
backwards, and finds the first match.  This means that it will always
use the user configuration instead of the builtin one (since new
configurations get added to the end of the list).

Like I said before, if you add all the values you want set in your
configuration, instead of relying on them being merged from the builtin
configuration, then you don't need to worry about any of this.

-Ben

> >
> > # VDSM REVISION 1.3
> > # VDSM PRIVATE
> >
> > defaults {
> > polling_interval5
> > no_path_retry   fail
> > user_friendly_names no
> > flush_on_last_del   yes
> > fast_io_fail_tmo5
> > dev_loss_tmo30
> > max_fds 4096
> > }
> >
> > # Remove devices entries when overrides section is available.
> > devices {
> > device {
> > # These settings overrides built-in devices settings. It does not
> > apply
> > # to devices without built-in settings (these use the settings in
> > the
> > # "defaults" section), or to devices defined in the "devices"
> > section.

Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-02 Thread Nir Soffer
On Thu, Feb 2, 2017 at 6:05 PM, Gianluca Cecchi
 wrote:
> On Thu, Feb 2, 2017 at 3:51 PM, Nir Soffer  wrote:
>>
>>
>> > Can you confirm that the host can be active when I restart vdsmd
>> > service?
>>
>> Sure. This may abort a storage operation if one is running when you
>> restart
>> vdsm, but vdsm is designed so you can restart or kill it safely.
>>
>> For example, if you abort a disk copy in the middle, the operation will
>> fail
>> and the destination disk will be deleted.
>>
>> If you want to avoid such issue, you can put a host to maintenance, but
>> this
>> requires migration of vms to other hosts.
>>
>> Nir
>
>
> OK. Created 50_thin_block_extension_rules.conf under /etc/vdsm/vdsm.conf.d
> and restarted vdsmd
>
> One last (latest probably... ;-) question
> Is it expected that if I restart vdsmd on the host that is the SPM, then SPM
> is shifted to another node?

Yes, engine will move spm to another host when spm fails, unless you
disabled spm role for any other host (see host > spm tab).

> Because when restarting vdsmd on the host that is not SPM I didn't get any
> message in web admin gui and restart of vdsmd itself was very fast.
> Instead on the host with SPM, the command took several seconds and I got
> these events

It is expected the restarting the spm is slower, but we need to see vdsm logs
to understand why.

> Feb 2, 2017 4:01:23 PM Host ovmsrv05 power management was verified
> successfully.
> Feb 2, 2017 4:01:23 PM Status of host ovmsrv05 was set to Up.
> Feb 2, 2017 4:01:19 PM Executing power management status on Host ovmsrv05
> using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
> Feb 2, 2017 4:01:18 PM Storage Pool Manager runs on Host ovmsrv06 (Address:
> ovmsrv06.datacenter.polimi.it).
> Feb 2, 2017 4:01:13 PM VDSM ovmsrv05 command failed: Recovering from crash
> or Initializing
> Feb 2, 2017 4:01:11 PM Host ovmsrv05 is initializing. Message: Recovering
> from crash or Initializing
> Feb 2, 2017 4:01:11 PM VDSM ovmsrv05 command failed: Recovering from crash
> or Initializing
> Feb 2, 2017 4:01:11 PM Invalid status on Data Center Default. Setting Data
> Center status to Non Responsive (On host ovmsrv05, Error: Recovering from
> crash or Initializing).
> Feb 2, 2017 4:01:11 PM VDSM ovmsrv05 command failed: Recovering from crash
> or Initializing
> Feb 2, 2017 4:01:05 PM Host ovmsrv05 is not responding. It will stay in
> Connecting state for a grace period of 80 seconds and after that an attempt
> to fence the host will be issued.
> Feb 2, 2017 4:01:05 PM Host ovmsrv05 is not responding. It will stay in
> Connecting state for a grace period of 80 seconds and after that an attempt
> to fence the host will be issued.
> Feb 2, 2017 4:01:05 PM VDSM ovmsrv05 command failed: Connection reset by
> peer

It look like the engine discovered that the SPM was down, and reconnected.

It is expected that changes in the spm status are detected early and engine
is trying to recover the spm, the SPM role is critical in ovirt.

Are you sure you did not get any message when restarting the other host?
I would expect that engine detect and report a restart of all hosts.

If you can reproduce this, restarting vdsm is not detected on engine and
not reported in engine even log, please file a bug.

Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-02 Thread Gianluca Cecchi
On Thu, Feb 2, 2017 at 3:51 PM, Nir Soffer  wrote:

>
> > Can you confirm that the host can be active when I restart vdsmd service?
>
> Sure. This may abort a storage operation if one is running when you restart
> vdsm, but vdsm is designed so you can restart or kill it safely.
>
> For example, if you abort a disk copy in the middle, the operation will
> fail
> and the destination disk will be deleted.
>
> If you want to avoid such issue, you can put a host to maintenance, but
> this
> requires migration of vms to other hosts.
>
> Nir
>

OK. Created 50_thin_block_extension_rules.conf under /etc/vdsm/vdsm.conf.d
and restarted vdsmd

One last (latest probably... ;-) question
Is it expected that if I restart vdsmd on the host that is the SPM, then
SPM is shifted to another node?
Because when restarting vdsmd on the host that is not SPM I didn't get any
message in web admin gui and restart of vdsmd itself was very fast.
Instead on the host with SPM, the command took several seconds and I got
these events

Feb 2, 2017 4:01:23 PM Host ovmsrv05 power management was verified
successfully.
Feb 2, 2017 4:01:23 PM Status of host ovmsrv05 was set to Up.
Feb 2, 2017 4:01:19 PM Executing power management status on Host ovmsrv05
using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Feb 2, 2017 4:01:18 PM Storage Pool Manager runs on Host ovmsrv06 (Address:
ovmsrv06.datacenter.polimi.it).
Feb 2, 2017 4:01:13 PM VDSM ovmsrv05 command failed: Recovering from crash
or Initializing
Feb 2, 2017 4:01:11 PM Host ovmsrv05 is initializing. Message: Recovering
from crash or Initializing
Feb 2, 2017 4:01:11 PM VDSM ovmsrv05 command failed: Recovering from crash
or Initializing
Feb 2, 2017 4:01:11 PM Invalid status on Data Center Default. Setting Data
Center status to Non Responsive (On host ovmsrv05, Error: Recovering from
crash or Initializing).
Feb 2, 2017 4:01:11 PM VDSM ovmsrv05 command failed: Recovering from crash
or Initializing
Feb 2, 2017 4:01:05 PM Host ovmsrv05 is not responding. It will stay in
Connecting state for a grace period of 80 seconds and after that an attempt
to fence the host will be issued.
Feb 2, 2017 4:01:05 PM Host ovmsrv05 is not responding. It will stay in
Connecting state for a grace period of 80 seconds and after that an attempt
to fence the host will be issued.
Feb 2, 2017 4:01:05 PM VDSM ovmsrv05 command failed: Connection reset by
peer

Thanks
Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-02 Thread Nir Soffer
On Thu, Feb 2, 2017 at 4:42 PM, Gianluca Cecchi
 wrote:
> On Thu, Feb 2, 2017 at 3:30 PM, Nir Soffer  wrote:
>>
>>
>> If you don't have a vdsm.conf file, or the file is empty, you can
>> generate a new file
>> like this:
>>
>> python /usr/lib64/python2.7/site-packages/vdsm/config.py >
>> vdsm.conf.examle
>
>
> thanks.
> It seems that the package python-libs-2.7.5-48.el7.x86_64 actually uses
> /usr/lib path instead of /usr/lib64...
> What worked
> python /usr/lib/python2.7/site-packages/vdsm/config.py > vdsm.conf.example

Indeed, my mistake, we move do lib several versions ago.

>
>>
>> >
>> >
>> >>
>> >>
>> >> > What I should do after changing them to make them active?
>> >>
>> >> Restart vdsm
>> >
>> >
>> > with host into maintenance mode or could it be active?
>> >
>> > Thanks
>> >
>
>
> Can you confirm that the host can be active when I restart vdsmd service?

Sure. This may abort a storage operation if one is running when you restart
vdsm, but vdsm is designed so you can restart or kill it safely.

For example, if you abort a disk copy in the middle, the operation will fail
and the destination disk will be deleted.

If you want to avoid such issue, you can put a host to maintenance, but this
requires migration of vms to other hosts.

Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-02 Thread Gianluca Cecchi
On Thu, Feb 2, 2017 at 3:30 PM, Nir Soffer  wrote:

>
> If you don't have a vdsm.conf file, or the file is empty, you can
> generate a new file
> like this:
>
> python /usr/lib64/python2.7/site-packages/vdsm/config.py >
> vdsm.conf.examle
>

thanks.
It seems that the package python-libs-2.7.5-48.el7.x86_64 actually
uses /usr/lib path instead of /usr/lib64...
What worked
python /usr/lib/python2.7/site-packages/vdsm/config.py > vdsm.conf.example


> >
> >
> >>
> >>
> >> > What I should do after changing them to make them active?
> >>
> >> Restart vdsm
> >
> >
> > with host into maintenance mode or could it be active?
> >
> > Thanks
> >
>

Can you confirm that the host can be active when I restart vdsmd service?

Thanks
Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-02 Thread Nir Soffer
On Thu, Feb 2, 2017 at 4:16 PM, Gianluca Cecchi
 wrote:
> On Thu, Feb 2, 2017 at 12:48 PM, Nir Soffer  wrote:
>>
>>
>> >
>> > What about [irs] values? Where are they located, in vdsm.conf?
>>
>> Yes but you should not modify them in vdsm.conf.
>>
>> > What are defaults for volume_utilization_percent and
>> > volume_utilization_chunk_mb?
>> > Did they change from 3.6 to 4.0 to 4.1?
>>
>> No, the defaults did not change in the last 3.5 years.
>
>
> OK.
> In a previous message of this thread you wrote that the default
> volume_utilization_chunk_mb is 1024
> What about default for volume_utilization_percent? 50?

Yes, this should be a commented out option in /etc/vdsm/vdsm.conf.

If you don't have a vdsm.conf file, or the file is empty, you can
generate a new file
like this:

python /usr/lib64/python2.7/site-packages/vdsm/config.py > vdsm.conf.examle

>
>
>>
>>
>> > What I should do after changing them to make them active?
>>
>> Restart vdsm
>
>
> with host into maintenance mode or could it be active?
>
> Thanks
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-02 Thread Gianluca Cecchi
On Thu, Feb 2, 2017 at 12:48 PM, Nir Soffer  wrote:

>
> >
> > What about [irs] values? Where are they located, in vdsm.conf?
>
> Yes but you should not modify them in vdsm.conf.
>
> > What are defaults for volume_utilization_percent and
> > volume_utilization_chunk_mb?
> > Did they change from 3.6 to 4.0 to 4.1?
>
> No, the defaults did not change in the last 3.5 years.
>

OK.
In a previous message of this thread you wrote that the default
volume_utilization_chunk_mb
is 1024
What about default for volume_utilization_percent? 50?



>
> > What I should do after changing them to make them active?
>
> Restart vdsm
>

with host into maintenance mode or could it be active?

Thanks
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-02 Thread Nir Soffer
On Thu, Feb 2, 2017 at 1:34 PM, Gianluca Cecchi
 wrote:
> On Thu, Feb 2, 2017 at 12:09 PM, Yaniv Kaul  wrote:
>>
>>
>>
>>>
>>>
>>> I decided to switch to preallocated for further tests and confirm
>>> So I created a snapshot and then a clone of the VM, changing allocation
>>> policy of the disk to preallocated.
>>> So far so good.
>>>
>>> Feb 2, 2017 10:40:23 AM VM ol65preallocated creation has been completed.
>>> Feb 2, 2017 10:24:15 AM VM ol65preallocated creation was initiated by
>>> admin@internal-authz.
>>> Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' has
>>> been completed.
>>> Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' was
>>> initiated by admin@internal-authz.
>>>
>>> so the throughput seems ok based on this storage type (the LUNs are on
>>> RAID5 made with sata disks): 16 minutes to write 90Gb is about 96MBytes/s,
>>> what expected
>>
>>
>> What is your expectation? Is it FC, iSCSI? How many paths? What is the IO
>> scheduler in the VM? Is it using virtio-blk or virtio-SCSI?
>> Y.
>>
>
> Peak bandwith no more than 140 MBytes/s, based on storage capabilities, but
> I don't have to do a rude performance test. I need stability
> Hosts has a mezzanine dual-port HBA (4 Gbit); each HBA connected to a
> different FC-switch and the multipath connection has 2 active paths (one for
> each HBA).
>
> I confirm that with preallocated disk of the cloned VM I don't have indeed
> the previous problems.
> The same loop executed for about 66 times in a 10 minutes interval without
> any problem registered on hosts
> No message at all in /var/log/messages of both hosts.
> My storage domain not compromised
> It remains important the question about thin provisioning and SAN LUNs (aka
> with LVM based disks).
> In my opinion I shouldn't care of the kind of I/O made inside a VM and
> anyway it shouldn't interfere with my storage domain, bringing down
> completely my hosts/VMs.
> In theory there could be an application inside a VM that generates something
> similar to my loop and so would generate problems.
> For sure I can then notify VM responsible about his/her workload, but it
> should not compromise my virtual infrastructure
> I could have an RDBMS inside a VM and a user that creates a big datafile and
> that should imply many extend operations if the disk is thin provisioned
>
> What about [irs] values? Where are they located, in vdsm.conf?

Yes but you should not modify them in vdsm.conf.

> What are defaults for volume_utilization_percent and
> volume_utilization_chunk_mb?
> Did they change from 3.6 to 4.0 to 4.1?

No, the defaults did not change in the last 3.5 years.

In 4.0 we introduced dropin support, and this is the recommend
way to perform configuration changes.

To change these values, you create a file at

/etc/vdsm/vdsm.conf.d/50_my.conf

The name of the file does not matter, vdsm will read all files in
the vdsm.conf.d directory, sort them by name (this is way you
should use 50_ prefix), and apply the changes to the configuration.

In this file you put the sections and options you need, like:

[irs]
volume_utilization_percent = 25
volume_utilization_chunk_mb = 2048

> What I should do after changing them to make them active?

Restart vdsm

Using these method, you can provision the same file on all hosts
using standard provisioning tools.

It is not recommended to modify /etc/vdsm/vdsm.conf. If you do
this you will have to manually merge changes from vdsm.conf.rpmnew
after upgrading vdsm.

Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-02 Thread Gianluca Cecchi
On Thu, Feb 2, 2017 at 12:09 PM, Yaniv Kaul  wrote:

>
>
>
>>
>> I decided to switch to preallocated for further tests and confirm
>> So I created a snapshot and then a clone of the VM, changing allocation
>> policy of the disk to preallocated.
>> So far so good.
>>
>> Feb 2, 2017 10:40:23 AM VM ol65preallocated creation has been completed.
>> Feb 2, 2017 10:24:15 AM VM ol65preallocated creation was initiated by
>> admin@internal-authz.
>> Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' has
>> been completed.
>> Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' was
>> initiated by admin@internal-authz.
>>
>> so the throughput seems ok based on this storage type (the LUNs are on
>> RAID5 made with sata disks): 16 minutes to write 90Gb is about 96MBytes/s,
>> what expected
>>
>
> What is your expectation? Is it FC, iSCSI? How many paths? What is the IO
> scheduler in the VM? Is it using virtio-blk or virtio-SCSI?
> Y.
>
>
Peak bandwith no more than 140 MBytes/s, based on storage capabilities, but
I don't have to do a rude performance test. I need stability
Hosts has a mezzanine dual-port HBA (4 Gbit); each HBA connected to a
different FC-switch and the multipath connection has 2 active paths (one
for each HBA).

I confirm that with preallocated disk of the cloned VM I don't have indeed
the previous problems.
The same loop executed for about 66 times in a 10 minutes interval without
any problem registered on hosts
No message at all in /var/log/messages of both hosts.
My storage domain not compromised
It remains important the question about thin provisioning and SAN LUNs (aka
with LVM based disks).
In my opinion I shouldn't care of the kind of I/O made inside a VM and
anyway it shouldn't interfere with my storage domain, bringing down
completely my hosts/VMs.
In theory there could be an application inside a VM that generates
something similar to my loop and so would generate problems.
For sure I can then notify VM responsible about his/her workload, but it
should not compromise my virtual infrastructure
I could have an RDBMS inside a VM and a user that creates a big datafile
and that should imply many extend operations if the disk is thin
provisioned

What about [irs] values? Where are they located, in vdsm.conf?
What are defaults for volume_utilization_percent and
volume_utilization_chunk_mb?
Did they change from 3.6 to 4.0 to 4.1?
What I should do after changing them to make them active?

Thanks in advance,
Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-02 Thread Nir Soffer
On Thu, Feb 2, 2017 at 12:04 PM, Gianluca Cecchi
 wrote:
>
>
> On Thu, Feb 2, 2017 at 10:48 AM, Nir Soffer  wrote:
>>
>> On Thu, Feb 2, 2017 at 1:11 AM, Gianluca Cecchi
>>  wrote:
>> > On Wed, Feb 1, 2017 at 8:22 PM, Gianluca Cecchi
>> > 
>> > wrote:
>> >>
>> >>
>> >> OK. In the mean time I have applied your suggested config and restarted
>> >> the 2 nodes.
>> >> Let we test and see if I find any problems running also some I/O tests.
>> >> Thanks in the mean time,
>> >> Gianluca
>> >
>> >
>> >
>> > Quick test without much success
>> >
>> > Inside the guest I run this loop
>> > while true
>> > do
>> > time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile
>> > sleep 5
>> > done
>>
>> I don't think this test is related to the issues you reported earlier.
>>
>
> I thought the same too, and all related comments you wrote.
> I'm going to test the suggested modifications for chunks.
> In general do you recommend thin provisioning at all on SAN storage?

Only if your storage does no support thin provisioning, or you need snapshot
support.

If you don't need these feature, using raw will be much more reliable
and faster.

Even if you use raw, you can still perform live storage migration; we
create a snapshot
using qcow2 format, copy the base raw volume to another storage, and
finally delete
the snapshot on the destination storage.

In the future (ovirt 5?) we would like to use only smart storage thin
provisioning
and snapshot support.

> I decided to switch to preallocated for further tests and confirm
> So I created a snapshot and then a clone of the VM, changing allocation
> policy of the disk to preallocated.
> So far so good.
>
> Feb 2, 2017 10:40:23 AM VM ol65preallocated creation has been completed.
> Feb 2, 2017 10:24:15 AM VM ol65preallocated creation was initiated by
> admin@internal-authz.
> Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' has
> been completed.
> Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' was
> initiated by admin@internal-authz.
>
> so the throughput seems ok based on this storage type (the LUNs are on RAID5
> made with sata disks): 16 minutes to write 90Gb is about 96MBytes/s, what
> expected
>
> What I see in messages during the cloning phasefrom 10:24 to 10:40:
>
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:14 ovmsrv05 journal: vdsm root WARNING File:
> /rhev/data-center/588237b8-0031-02f6-035d-0136/922b5269-ab56-4c4d-838f-49d33427e2ab/images/9d1c977f-540d-436a-9d93-b1cb0816af2a/607dbf59-7d4d-4fc3-ae5f-e8824bf82648
> already removed
> Feb  2 10:24:14 ovmsrv05 multipathd: dm-15: remove map (uevent)
> Feb  2 10:24:14 ovmsrv05 multipathd: dm-15: devmap not registered, can't
> remove
> Feb  2 10:24:14 ovmsrv05 multipathd: dm-15: remove map (uevent)
> Feb  2 10:24:17 ovmsrv05 kernel: blk_update_request: critical target error,
> dev dm-4, sector 44566529
> Feb  2 10:24:17 ovmsrv05 kernel: dm-15: WRITE SAME failed. Manually zeroing.
> Feb  2 10:40:07 ovmsrv05 kernel: scsi_verify_blk_ioctl: 16 callbacks
> suppressed
> Feb  2 10:40:07 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:40:17 ovmsrv05 multipathd: dm-15: remove map (uevent)
> Feb  2 10:40:17 ovmsrv05 multipathd: dm-15: devmap not registered, can't
> remove
> Feb  2 10:40:17 ovmsrv05 multipathd: dm-15: remove map (uevent)
> Feb  2 10:40:22 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!

Lets file a bug to investigate these "kernel: dd: sending ioctl
80306d02 to a partition!"
messages.

Please attach vdsm log on the machine emitting theses logs.

>> > After about 7 rounds I get this in messages of the host where the VM is
>> > running:
>> >
>> > Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
>> > partition!
>> > Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
>> > partition!
>> > Feb  1 23:31:44 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
>> > partition!
>> > Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
>> > partition!
>> > Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
>> > partition!
>> > Feb  1 23:31:47 ovmsrv06 

Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-02 Thread Yaniv Kaul
On Thu, Feb 2, 2017 at 12:04 PM, Gianluca Cecchi 
wrote:

>
>
> On Thu, Feb 2, 2017 at 10:48 AM, Nir Soffer  wrote:
>
>> On Thu, Feb 2, 2017 at 1:11 AM, Gianluca Cecchi
>>  wrote:
>> > On Wed, Feb 1, 2017 at 8:22 PM, Gianluca Cecchi <
>> gianluca.cec...@gmail.com>
>> > wrote:
>> >>
>> >>
>> >> OK. In the mean time I have applied your suggested config and restarted
>> >> the 2 nodes.
>> >> Let we test and see if I find any problems running also some I/O tests.
>> >> Thanks in the mean time,
>> >> Gianluca
>> >
>> >
>> >
>> > Quick test without much success
>> >
>> > Inside the guest I run this loop
>> > while true
>> > do
>> > time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile
>>
>
A single 'dd' rarely saturates a high performance storage.
There are better utilities to test ('fio' , 'vdbench' and 'ddpt' for
example).
It's also testing a very theoretical scenario - very rarely you write zeros
and very rarely you write so much sequential IO, and with a fixed block
size. So it's almost 'hero numbers'.

> sleep 5
>> > done
>>
>> I don't think this test is related to the issues you reported earlier.
>>
>>
> I thought the same too, and all related comments you wrote.
> I'm going to test the suggested modifications for chunks.
> In general do you recommend thin provisioning at all on SAN storage?
>

Depends on your SAN. On thin provisioned one (with potentially inline dedup
and compression, such as XtremIO, Pure, Nimble and others) I don't see a
great value in thin provisioning.


>
> I decided to switch to preallocated for further tests and confirm
> So I created a snapshot and then a clone of the VM, changing allocation
> policy of the disk to preallocated.
> So far so good.
>
> Feb 2, 2017 10:40:23 AM VM ol65preallocated creation has been completed.
> Feb 2, 2017 10:24:15 AM VM ol65preallocated creation was initiated by
> admin@internal-authz.
> Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' has
> been completed.
> Feb 2, 2017 10:22:31 AM Snapshot 'for cloning' creation for VM 'ol65' was
> initiated by admin@internal-authz.
>
> so the throughput seems ok based on this storage type (the LUNs are on
> RAID5 made with sata disks): 16 minutes to write 90Gb is about 96MBytes/s,
> what expected
>

What is your expectation? Is it FC, iSCSI? How many paths? What is the IO
scheduler in the VM? Is it using virtio-blk or virtio-SCSI?
Y.



>
> What I see in messages during the cloning phasefrom 10:24 to 10:40:
>
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:13 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:24:14 ovmsrv05 journal: vdsm root WARNING File:
> /rhev/data-center/588237b8-0031-02f6-035d-0136/
> 922b5269-ab56-4c4d-838f-49d33427e2ab/images/9d1c977f-
> 540d-436a-9d93-b1cb0816af2a/607dbf59-7d4d-4fc3-ae5f-e8824bf82648 already
> removed
> Feb  2 10:24:14 ovmsrv05 multipathd: dm-15: remove map (uevent)
> Feb  2 10:24:14 ovmsrv05 multipathd: dm-15: devmap not registered, can't
> remove
> Feb  2 10:24:14 ovmsrv05 multipathd: dm-15: remove map (uevent)
> Feb  2 10:24:17 ovmsrv05 kernel: blk_update_request: critical target
> error, dev dm-4, sector 44566529
> Feb  2 10:24:17 ovmsrv05 kernel: dm-15: WRITE SAME failed. Manually
> zeroing.
> Feb  2 10:40:07 ovmsrv05 kernel: scsi_verify_blk_ioctl: 16 callbacks
> suppressed
> Feb  2 10:40:07 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  2 10:40:17 ovmsrv05 multipathd: dm-15: remove map (uevent)
> Feb  2 10:40:17 ovmsrv05 multipathd: dm-15: devmap not registered, can't
> remove
> Feb  2 10:40:17 ovmsrv05 multipathd: dm-15: remove map (uevent)
> Feb  2 10:40:22 ovmsrv05 kernel: dd: sending ioctl 80306d02 to a partition!
>
>
>
>
>> > After about 7 rounds I get this in messages of the host where the VM is
>> > running:
>> >
>> > Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
>> partition!
>> > Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
>> partition!
>> > Feb  1 23:31:44 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
>> partition!
>> > Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
>> partition!
>> > Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a
>> partition!
>> > Feb  1 23:31:47 

Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-02 Thread Nir Soffer
On Thu, Feb 2, 2017 at 1:11 AM, Gianluca Cecchi
 wrote:
> On Wed, Feb 1, 2017 at 8:22 PM, Gianluca Cecchi 
> wrote:
>>
>>
>> OK. In the mean time I have applied your suggested config and restarted
>> the 2 nodes.
>> Let we test and see if I find any problems running also some I/O tests.
>> Thanks in the mean time,
>> Gianluca
>
>
>
> Quick test without much success
>
> Inside the guest I run this loop
> while true
> do
> time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile
> sleep 5
> done

I don't think this test is related to the issues you reported earlier.

What you test here is how fast ovirt thin provisioning can extend your
disk when writing zeros. We don't handle that very well, each extend
needs at couple 4-6 seconds to complete before we refresh the lv
on the host running the vm, and this is the *best* case. In the worst
case it can take much longer.

Also what you tested here is how fast you can write to your vm buffer cache,
since you are not using direct io.

A better way to perform this test is:

time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile oflag=direct

This will give you the time to actually write data to storage.

If you have a real issue with vms pausing during writes when vm disk has to
be extended, you can enlarge the extend chunk, 1GiB by default.

To use chunks of 2GiB, set:

[irs]
volume_utilization_percent = 50
volume_utilization_chunk_mb = 2048

This will extend the drive when free space is less than 1024MiB
(volume_utilization_chunk_mb * (100 - volume_utilization_percent) / 100)

If this is not enough, you can also use lower volume_utilization_percent,
for example, this will extend the disk in 2GiB chunks when free space
is bellow 1536MiB:

[irs]
volume_utilization_percent = 25
volume_utilization_chunk_mb = 2048

> BTW: my home is inside / filesystem on guest that has space to accomodate
> 1Gb of the dd command:
> [g.cecchi@ol65 ~]$ df -h /home/g.cecchi/
> FilesystemSize  Used Avail Use% Mounted on
> /dev/mapper/vg_ol65-lv_root
>20G  4.9G   14G  27% /
> [g.cecchi@ol65 ~]$
>
> After about 7 rounds I get this in messages of the host where the VM is
> running:
>
> Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  1 23:31:44 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  1 23:31:47 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  1 23:31:56 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
> Feb  1 23:31:57 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!

This is interesting, we have seen this messages before, but could never
detect the flow causing them, are you sure you see this each time you extend
your disk?

If you can reproduce this, please file a bug.

>
> Nothing on the other host.
> In web admin events pane:
> Feb 1, 2017 11:31:44 PM VM ol65 has been paused due to no Storage space
> error.
> Feb 1, 2017 11:31:44 PM VM ol65 has been paused.
>
> I stop the dd loop and after some seconds:
> Feb 1, 2017 11:32:32 PM VM ol65 has recovered from paused back to up
>
> Multipath status for my device:
>
> 3600a0b8000299aa8d08b55014119 dm-2 IBM ,1814  FAStT
> size=4.0T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
> rdac' wp=rw
> |-+- policy='service-time 0' prio=0 status=active
> | |- 0:0:1:3 sdj 8:144 active undef running
> | `- 2:0:1:3 sdp 8:240 active undef running
> `-+- policy='service-time 0' prio=0 status=enabled
>   |- 0:0:0:3 sdd 8:48  active undef running
>   `- 2:0:0:3 sdk 8:160 active undef running
>
> In engine.log
>
> 2017-02-01 23:22:01,449 INFO
> [org.ovirt.engine.core.bll.aaa.CreateUserSessionCommand] (default task-15)
> [530aee87] Running command: CreateUserSessionCommand internal: false.
> 2017-02-01 23:22:04,011 INFO
> [org.ovirt.engine.docs.utils.servlet.ContextSensitiveHelpMappingServlet]
> (default task-12) [] Context-sensitive help is not installed. Manual
> directory doesn't exist: /usr/share/ovirt-engine/manual
> 2017-02-01 23:31:43,936 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (ForkJoinPool-1-worker-10) [10c15e39] VM
> '932db7c7-4121-4cbe-ad8d-09e4e99b3cdd'(ol65) moved from 'Up' --> 'Paused'
> 2017-02-01 23:31:44,087 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (ForkJoinPool-1-worker-10) [10c15e39] Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: VM ol65 has been paused.
> 2017-02-01 23:31:44,227 ERROR
> 

Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-01 Thread Gianluca Cecchi
On Wed, Feb 1, 2017 at 8:22 PM, Gianluca Cecchi 
wrote:

>
> OK. In the mean time I have applied your suggested config and restarted
> the 2 nodes.
> Let we test and see if I find any problems running also some I/O tests.
> Thanks in the mean time,
> Gianluca
>


Quick test without much success

Inside the guest I run this loop
while true
do
time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile
sleep 5
done

BTW: my home is inside / filesystem on guest that has space to accomodate
1Gb of the dd command:
[g.cecchi@ol65 ~]$ df -h /home/g.cecchi/
FilesystemSize  Used Avail Use% Mounted on
/dev/mapper/vg_ol65-lv_root
   20G  4.9G   14G  27% /
[g.cecchi@ol65 ~]$

After about 7 rounds I get this in messages of the host where the VM is
running:

Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:44 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:47 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:56 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:57 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!

Nothing on the other host.
In web admin events pane:
Feb 1, 2017 11:31:44 PM VM ol65 has been paused due to no Storage space
error.
Feb 1, 2017 11:31:44 PM VM ol65 has been paused.

I stop the dd loop and after some seconds:
Feb 1, 2017 11:32:32 PM VM ol65 has recovered from paused back to up

Multipath status for my device:

3600a0b8000299aa8d08b55014119 dm-2 IBM ,1814  FAStT
size=4.0T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| |- 0:0:1:3 sdj 8:144 active undef running
| `- 2:0:1:3 sdp 8:240 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  |- 0:0:0:3 sdd 8:48  active undef running
  `- 2:0:0:3 sdk 8:160 active undef running

In engine.log

2017-02-01 23:22:01,449 INFO
[org.ovirt.engine.core.bll.aaa.CreateUserSessionCommand] (default task-15)
[530aee87] Running command: CreateUserSessionCommand internal: false.
2017-02-01 23:22:04,011 INFO
[org.ovirt.engine.docs.utils.servlet.ContextSensitiveHelpMappingServlet]
(default task-12) [] Context-sensitive help is not installed. Manual
directory doesn't exist: /usr/share/ovirt-engine/manual
2017-02-01 23:31:43,936 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(ForkJoinPool-1-worker-10) [10c15e39] VM
'932db7c7-4121-4cbe-ad8d-09e4e99b3cdd'(ol65) moved from 'Up' --> 'Paused'
2017-02-01 23:31:44,087 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(ForkJoinPool-1-worker-10) [10c15e39] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM ol65 has been paused.
2017-02-01 23:31:44,227 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(ForkJoinPool-1-worker-10) [10c15e39] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM ol65 has been paused due to no
Storage space error.
2017-02-01 23:32:32,100 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler5) [1c02a371] VM
'932db7c7-4121-4cbe-ad8d-09e4e99b3cdd'(ol65) moved from 'Paused' --> 'Up'
2017-02-01 23:32:32,259 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler5) [1c02a371] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM ol65 has recovered from paused back
to up.


In vdsm.log of the host where the VM is running I see this:

mailbox.SPMMonitor::DEBUG::2017-02-01 23:31:37,729::storage_mailbox:
:733::Storage.Misc.excCmd::(_chec
kForMail) /usr/bin/taskset --cpu-list 0-7 dd if=/rhev/data-center/588237b8-
0031-02f6-035d-013
6/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None)
mailbox.SPMMonitor::DEBUG::2017-02-01 23:31:37,764::storage_mailbox:
:733::Storage.Misc.excCmd::(_chec
kForMail) SUCCESS:  = '1+0 records in\n1+0 records out\n1024000 bytes
(1.0 MB) copied, 0.0061866
 s, 166 MB/s\n';  = 0
periodic/8::INFO::2017-02-01
23:31:39,699::vm::1028::virt.vm::(extendDrivesIfNeeded)
vmId=`932db7c7-4
121-4cbe-ad8d-09e4e99b3cdd`::Requesting extension for volume
66615920-619f-4f52-ad5a-3f062c054094 on
domain 922b5269-ab56-4c4d-838f-49d33427e2ab (apparent: 12884901888,
capacity: 96636764160, allocated:
 12355698688, physical: 12884901888)
periodic/8::DEBUG::2017-02-01
23:31:39,699::vm::1093::virt.vm::(__extendDriveVolume)
vmId=`932db7c7-4
121-4cbe-ad8d-09e4e99b3cdd`::Requesting an extension for the volume:
{'newSize': 

Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-01 Thread Michael Watters
I'm also seeing this error using a Dell MD3800i array.  The multipath
errors shown in our logs are different however.

Feb  1 15:11:58 ovirt-node-production2 kernel: dd: sending ioctl
80306d02 to a partition!
Feb  1 15:21:01 ovirt-node-production2 multipathd: dm-31: remove map
(uevent)
Feb  1 15:21:01 ovirt-node-production2 multipathd: dm-31: devmap not
registered, can't remove
Feb  1 15:21:01 ovirt-node-production2 multipathd: dm-31: remove map
(uevent)

The dd error seems to happen every time that SPM runs a test.


On 01/31/2017 09:23 AM, Nathanaël Blanchet wrote:
>
> exactly the same issue by there with FC EMC domain storage...
>
>
> Le 31/01/2017 à 15:20, Gianluca Cecchi a écrit :
>> Hello,
>> my test environment is composed by 2 old HP blades BL685c G1
>> (ovmsrv05 and ovmsrv06) and they are connected in a SAN with
>> FC-switches to an old IBM DS4700 storage array.
>> Apart from being old, they seem all ok from an hw point of view.
>> I have configured oVirt 4.0.6 and an FCP storage domain.
>> The hosts are plain CentOS 7.3 servers fully updated.
>> It is not an hosted engine environment: the manager is a vm outside
>> of the cluster.
>> I have configured power mgmt on both and it works good.
>>
>> I have at the moment  only one VM for test and it is doing quite nothing.
>>
>> Starting point: ovmsrv05 is in maintenance (since about 2 days) and
>> the VM is running on ovmsrv06.
>> I update qemu-kvm package on ovmsrv05 and then I restart it from web
>> admin gui:
>> Power Mgmt --> Restart
>>
>> Sequence of events in pane and the problem in subject:
>> Jan 31, 2017 10:29:43 AM Host ovmsrv05 power management was verified
>> successfully.
>> Jan 31, 2017 10:29:43 AM Status of host ovmsrv05 was set to Up.
>> Jan 31, 2017 10:29:38 AM Executing power management status on Host
>> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
>> Jan 31, 2017 10:29:29 AM Activation of host ovmsrv05 initiated by
>> admin@internal-authz.
>> Jan 31, 2017 10:28:05 AM VM ol65 has recovered from paused back to up.
>> Jan 31, 2017 10:27:55 AM VM ol65 has been paused due to storage I/O
>> problem.
>> Jan 31, 2017 10:27:55 AM VM ol65 has been paused.
>> Jan 31, 2017 10:25:52 AM Host ovmsrv05 was restarted by
>> admin@internal-authz.
>> Jan 31, 2017 10:25:52 AM Host ovmsrv05 was started by
>> admin@internal-authz.
>> Jan 31, 2017 10:25:52 AM Power management start of Host ovmsrv05
>> succeeded.
>> Jan 31, 2017 10:25:50 AM Executing power management status on Host
>> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
>> Jan 31, 2017 10:25:37 AM Executing power management start on Host
>> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
>> Jan 31, 2017 10:25:37 AM Power management start of Host ovmsrv05
>> initiated.
>> Jan 31, 2017 10:25:37 AM Auto fence for host ovmsrv05 was started.
>> Jan 31, 2017 10:25:37 AM All VMs' status on Non Responsive Host
>> ovmsrv05 were changed to 'Down' by admin@internal-authz
>> Jan 31, 2017 10:25:36 AM Host ovmsrv05 was stopped by
>> admin@internal-authz.
>> Jan 31, 2017 10:25:36 AM Power management stop of Host ovmsrv05
>> succeeded.
>> Jan 31, 2017 10:25:34 AM Executing power management status on Host
>> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
>> Jan 31, 2017 10:25:15 AM Executing power management stop on Host
>> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
>> Jan 31, 2017 10:25:15 AM Power management stop of Host ovmsrv05
>> initiated.
>> Jan 31, 2017 10:25:12 AM Executing power management status on Host
>> ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
>>
>> Watching the timestamps, the culprit seems the reboot time of
>> ovmsrv05 that detects some LUNs in owned state and other ones in unowned
>> Full messages of both hosts here:
>> https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NMRlU/view?usp=sharing
>> and
>> https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdXTms/view?usp=sharing
>>
>> At this time there are 4 LUNs globally seen by the two hosts but only
>> 1 of them is currently configured as the only storage domain in oVirt
>> cluster.
>>
>> [root@ovmsrv05 ~]# multipath -l | grep ^36
>> 3600a0b8000299aa8d08b55014119 dm-5 IBM ,1814  FAStT 
>> 3600a0b8000299902cd3c5501458f dm-3 IBM ,1814  FAStT 
>> 3600a0b8000299902ccf855011198 dm-2 IBM ,1814  FAStT 
>> 3600a0b8000299aa8d08955014098 dm-4 IBM ,1814  FAStT 
>>
>> the configured one:
>> [root@ovmsrv05 ~]# multipath -l 3600a0b8000299aa8d08b55014119
>> 3600a0b8000299aa8d08b55014119 dm-5 IBM ,1814  FAStT 
>> size=4.0T features='0' hwhandler='1 rdac' wp=rw
>> |-+- policy='service-time 0' prio=0 status=active
>> | |- 0:0:1:3 sdl 8:176 active undef running
>> | `- 2:0:1:3 sdp 8:240 active undef running
>> `-+- policy='service-time 0' prio=0 status=enabled
>>   |- 0:0:0:3 sdd 8:48  active undef running
>>   `- 2:0:0:3 sdi 8:128 active undef running

Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-02-01 Thread Gianluca Cecchi
On Wed, Feb 1, 2017 at 8:39 AM, Nir Soffer  wrote:

>
>
> Hi Gianluca,
>
> This should be a number, not a string, maybe multipath is having trouble
> parsing this and it ignores your value?
>

I don't think so. Also because reading dm multipath guide at
https://access.redhat.com/documentation/en-US/Red_Hat_Enterp
rise_Linux/7/html/DM_Multipath/multipath_config_confirm.html
It seems that in RH EL 7.3 the "show config" command has this behaviour:

"
For example, the following command sequence displays the multipath
configuration, including the defaults, before exiting the console.

# multipathd -k
> > show config
> > CTRL-D
"

So the output has to include the default too. Anyway I changed it, see below

< BEGIN OF PARENTHESIS
In theory it should be the same on RH EL 6.8
(see https://access.redhat.com/documentation/en-US/Red_Hat_Enterp
rise_Linux/6/html/DM_Multipath/multipath_config_confirm.html )
but it is not so for me on a system that is on 6.5, with
device-mapper-multipath-0.4.9-93.el6.x86_64 and connected to Netapp

In /usr/share/doc/device-mapper-multipath-0.4.9/multipath.conf.defaults


#   vendor "NETAPP"
#   product "LUN.*"
#   path_grouping_policy group_by_prio
#   getuid_callout "/lib/udev/scsi_id --whitelisted
--device=/dev/%n"
#   path_selector "round-robin 0"
#   path_checker tur
#   features "3 queue_if_no_path pg_init_retries 50"
#   hardware_handler "0"
#   prio ontap
#   failback immediate
#   rr_weight uniform
#   rr_min_io 128
#   rr_min_io_rq 1
#   flush_on_last_del yes
#   fast_io_fail_tmo 5
#   dev_loss_tmo infinity
#   retain_attached_hw_handler yes
#   detect_prio yes
#   reload_readwrite yes
#   }



My customization in multipath.conf, based on Netapp guidelines and my
Netapp storage array setup:

devices {
   device {
   vendor "NETAPP"
   product "LUN.*"
   getuid_callout "/lib/udev/scsi_id -g -u -d /dev/%n"
   hardware_handler "1 alua"
   prio alua
}

If I run "multipathd show config" I see only 1 entry for NETAPP/LUN
vendor/product and it is a merge of default and my custom.

device {
vendor "NETAPP"
product "LUN.*"
path_grouping_policy group_by_prio
getuid_callout "/lib/udev/scsi_id -g -u -d /dev/%n"
path_selector "round-robin 0"
path_checker tur
features "3 queue_if_no_path pg_init_retries 50"
hardware_handler "1 alua"
prio alua
failback immediate
rr_weight uniform
rr_min_io 128
rr_min_io_rq 1
flush_on_last_del yes
fast_io_fail_tmo 5
dev_loss_tmo infinity
retain_attached_hw_handler yes
detect_prio yes
reload_readwrite yes
}

So this difference confused me when configuring multipath in CentOS 7.3. I
have to see when I'm going to update from 6.5 to 6.8 if this changes.

< END OF PARENTHESIS


> > }
> > }
> >
> > So I put exactly the default device config for my IBM/1814 device but
> > no_path_retry set to 12.
>
> Why 12?
>
> This will do 12 retries, 5 seconds each when no path is available. This
> will
> block lvm commands for 60 seconds when no path is available, blocking
> other stuff in vdsm. Vdsm is not designed to handle this.
>
> I recommend value of 4.
>

OK.



> But note that this will is not related to the fact that your devices are
> not
> initialize properly after boot.
>

In fact it could be also a ds4700 overall problem The LUNs are
configured as LNX CLUSTER type, that should be ok in theory, even if this
kind of storage was never so supported with Linux.
Initially one had to use proprietary IBM kernel modules/drivers.
I will see consistency and robustness through testing.
I have to do a POC and this is the hw I have and I should at least try to
have a working solution for it.


>
> > In CentOS 6.x when you do something like this, "show config" gives you
> the
> > modified entry only for your device section.
> > Instead in CentOS 7.3 it seems I get anyway the default one for IBM/1814
> and
> > also the customized one at the end of the output
>
> Maybe your device configuration does not match exactly the builtin config.
>

I think it is the different behaviour as outlined above. I think you can
confirm in another system where some customization has been done too...


>
>
> Maybe waiting a moment helps the storage/switches to clean up
> properly after a server is shut down?
>

I think so too. Eventually when possible, if errors repeat with the new
config, I'll manage to do stop/start instead of restart


>
> 

Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-01-31 Thread Nir Soffer
On Tue, Jan 31, 2017 at 6:09 PM, Gianluca Cecchi
 wrote:
> On Tue, Jan 31, 2017 at 3:23 PM, Nathanaël Blanchet 
> wrote:
>>
>> exactly the same issue by there with FC EMC domain storage...
>>
>>
>
> I'm trying to mitigate inserting a timeout for my SAN devices but I'm not
> sure of its effectiveness as CentOS 7 behavior  of "multipathd -k" and then
> "show config" seems different from CentOS 6.x
> In fact my attempt for multipath.conf is this
>
>
> # VDSM REVISION 1.3
> # VDSM PRIVATE
>
> defaults {
> polling_interval5
> no_path_retry   fail
> user_friendly_names no
> flush_on_last_del   yes
> fast_io_fail_tmo5
> dev_loss_tmo30
> max_fds 4096
> }
>
> # Remove devices entries when overrides section is available.
> devices {
> device {
> # These settings overrides built-in devices settings. It does not
> apply
> # to devices without built-in settings (these use the settings in
> the
> # "defaults" section), or to devices defined in the "devices"
> section.
> # Note: This is not available yet on Fedora 21. For more info see
> # https://bugzilla.redhat.com/1253799
> all_devsyes
> no_path_retry   fail
> }
> device {
> vendor "IBM"
> product "^1814"
> product_blacklist "Universal Xport"
> path_grouping_policy "group_by_prio"
> path_checker "rdac"
> features "0"
> hardware_handler "1 rdac"
> prio "rdac"
> failback immediate
> rr_weight "uniform"
> no_path_retry "12"

Hi Gianluca,

This should be a number, not a string, maybe multipath is having trouble
parsing this and it ignores your value?

> }
> }
>
> So I put exactly the default device config for my IBM/1814 device but
> no_path_retry set to 12.

Why 12?

This will do 12 retries, 5 seconds each when no path is available. This will
block lvm commands for 60 seconds when no path is available, blocking
other stuff in vdsm. Vdsm is not designed to handle this.

I recommend value of 4.

But note that this will is not related to the fact that your devices are not
initialize properly after boot.

> In CentOS 6.x when you do something like this, "show config" gives you the
> modified entry only for your device section.
> Instead in CentOS 7.3 it seems I get anyway the default one for IBM/1814 and
> also the customized one at the end of the output

Maybe your device configuration does not match exactly the builtin config.

>
> Two facts:
> - before I could reproduce the problem if I selected
> Maintenance
> Power Mgmt ---> Restart
> (tried 3 times with same behavior)
>
> Instead if I executed in separate steps
> Maintenance
> Power Mgmt --> Stop
> wait a moment
> Power Mgmt --> Start
>
> I didn't get problems (tried only one time...)

Maybe waiting a moment helps the storage/switches to clean up
properly after a server is shut down?

Does your power management trigger a proper shutdown?
I would avoid using it for normal shutdown.

>
> With this "new" multipath config (to be confirmed if in effect, how?) I
> don't get the VM paused problem even with Restart option of Power Mgmt
> In active host messages I see these ones when the other reboots:
>
> Jan 31 16:50:01 ovmsrv06 systemd: Started Session 705 of user root.
> Jan 31 16:50:01 ovmsrv06 systemd: Starting Session 705 of user root.
> Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sde
> - rdac checker reports path is up
> Jan 31 16:53:47 ovmsrv06 multipathd: 8:64: reinstated
> Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: load
> table [0 41943040 multipath 1 queue_if_no_path 1 rdac 2 1 service-time 0 2 1
> 8:224 1 65:0 1 service-time 0 2 1 8:64 1 8:160 1]
> Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdo
> - rdac checker reports path is ghost
> Jan 31 16:53:47 ovmsrv06 multipathd: 8:224: reinstated
> Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdk
> - rdac checker reports path is up
> Jan 31 16:53:47 ovmsrv06 multipathd: 8:160: reinstated
> Jan 31 16:53:47 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
> queueing MODE_SELECT command
> Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdq
> - rdac checker reports path is ghost
> Jan 31 16:53:47 ovmsrv06 multipathd: 65:0: reinstated
> Jan 31 16:53:48 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
> MODE_SELECT returned with sense 05/91/36
> Jan 31 16:53:48 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
> queueing MODE_SELECT command
> Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
> MODE_SELECT returned with sense 05/91/36
> Jan 31 

Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-01-31 Thread Gianluca Cecchi
On Tue, Jan 31, 2017 at 3:23 PM, Nathanaël Blanchet 
wrote:

> exactly the same issue by there with FC EMC domain storage...
>
>
I'm trying to mitigate inserting a timeout for my SAN devices but I'm not
sure of its effectiveness as CentOS 7 behavior  of "multipathd -k" and then
"show config" seems different from CentOS 6.x
In fact my attempt for multipath.conf is this


# VDSM REVISION 1.3
# VDSM PRIVATE

defaults {
polling_interval5
no_path_retry   fail
user_friendly_names no
flush_on_last_del   yes
fast_io_fail_tmo5
dev_loss_tmo30
max_fds 4096
}

# Remove devices entries when overrides section is available.
devices {
device {
# These settings overrides built-in devices settings. It does not
apply
# to devices without built-in settings (these use the settings in
the
# "defaults" section), or to devices defined in the "devices"
section.
# Note: This is not available yet on Fedora 21. For more info see
# https://bugzilla.redhat.com/1253799
all_devsyes
no_path_retry   fail
}
device {
vendor "IBM"
product "^1814"
product_blacklist "Universal Xport"
path_grouping_policy "group_by_prio"
path_checker "rdac"
features "0"
hardware_handler "1 rdac"
prio "rdac"
failback immediate
rr_weight "uniform"
no_path_retry "12"
}
}

So I put exactly the default device config for my IBM/1814 device but
no_path_retry set to 12.

In CentOS 6.x when you do something like this, "show config" gives you the
modified entry only for your device section.
Instead in CentOS 7.3 it seems I get anyway the default one for IBM/1814
and also the customized one at the end of the output

Two facts:
- before I could reproduce the problem if I selected
Maintenance
Power Mgmt ---> Restart
(tried 3 times with same behavior)

Instead if I executed in separate steps
Maintenance
Power Mgmt --> Stop
wait a moment
Power Mgmt --> Start

I didn't get problems (tried only one time...)

With this "new" multipath config (to be confirmed if in effect, how?) I
don't get the VM paused problem even with Restart option of Power Mgmt
In active host messages I see these ones when the other reboots:

Jan 31 16:50:01 ovmsrv06 systemd: Started Session 705 of user root.
Jan 31 16:50:01 ovmsrv06 systemd: Starting Session 705 of user root.
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sde
- rdac checker reports path is up
Jan 31 16:53:47 ovmsrv06 multipathd: 8:64: reinstated
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098:
load table [0 41943040 multipath 1 queue_if_no_path 1 rdac 2 1 service-time
0 2 1 8:224 1 65:0 1 service-time 0 2 1 8:64 1 8:160 1]
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdo
- rdac checker reports path is ghost
Jan 31 16:53:47 ovmsrv06 multipathd: 8:224: reinstated
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdk
- rdac checker reports path is up
Jan 31 16:53:47 ovmsrv06 multipathd: 8:160: reinstated
Jan 31 16:53:47 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
queueing MODE_SELECT command
Jan 31 16:53:47 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdq
- rdac checker reports path is ghost
Jan 31 16:53:47 ovmsrv06 multipathd: 65:0: reinstated
Jan 31 16:53:48 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
MODE_SELECT returned with sense 05/91/36
Jan 31 16:53:48 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
queueing MODE_SELECT command
Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
MODE_SELECT returned with sense 05/91/36
Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
queueing MODE_SELECT command
Jan 31 16:53:49 ovmsrv06 kernel: sd 0:0:1:4: rdac: array Z1_DS4700, ctlr 1,
MODE_SELECT completed
Jan 31 16:53:49 ovmsrv06 kernel: sd 2:0:1:4: rdac: array Z1_DS4700, ctlr 1,
queueing MODE_SELECT command
Jan 31 16:53:49 ovmsrv06 kernel: sd 2:0:1:4: rdac: array Z1_DS4700, ctlr 1,
MODE_SELECT completed
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sde
- rdac checker reports path is ghost
Jan 31 16:53:52 ovmsrv06 multipathd: 8:64: reinstated
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098:
load table [0 41943040 multipath 1 queue_if_no_path 1 rdac 2 1 service-time
0 2 1 8:224 1 65:0 1 service-time 0 2 1 8:64 1 8:160 1]
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdo
- rdac checker reports path is up
Jan 31 16:53:52 ovmsrv06 multipathd: 8:224: reinstated
Jan 31 16:53:52 ovmsrv06 multipathd: 3600a0b8000299aa8d08955014098: sdk
- rdac checker reports path is ghost

Re: [ovirt-users] VM has been paused due to storage I/O problem

2017-01-31 Thread Nathanaël Blanchet

exactly the same issue by there with FC EMC domain storage...


Le 31/01/2017 à 15:20, Gianluca Cecchi a écrit :

Hello,
my test environment is composed by 2 old HP blades BL685c G1 (ovmsrv05 
and ovmsrv06) and they are connected in a SAN with FC-switches to an 
old IBM DS4700 storage array.

Apart from being old, they seem all ok from an hw point of view.
I have configured oVirt 4.0.6 and an FCP storage domain.
The hosts are plain CentOS 7.3 servers fully updated.
It is not an hosted engine environment: the manager is a vm outside of 
the cluster.

I have configured power mgmt on both and it works good.

I have at the moment  only one VM for test and it is doing quite nothing.

Starting point: ovmsrv05 is in maintenance (since about 2 days) and 
the VM is running on ovmsrv06.
I update qemu-kvm package on ovmsrv05 and then I restart it from web 
admin gui:

Power Mgmt --> Restart

Sequence of events in pane and the problem in subject:
Jan 31, 2017 10:29:43 AM Host ovmsrv05 power management was verified 
successfully.

Jan 31, 2017 10:29:43 AM Status of host ovmsrv05 was set to Up.
Jan 31, 2017 10:29:38 AM Executing power management status on Host 
ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Jan 31, 2017 10:29:29 AM Activation of host ovmsrv05 initiated by 
admin@internal-authz.

Jan 31, 2017 10:28:05 AM VM ol65 has recovered from paused back to up.
Jan 31, 2017 10:27:55 AM VM ol65 has been paused due to storage I/O 
problem.

Jan 31, 2017 10:27:55 AM VM ol65 has been paused.
Jan 31, 2017 10:25:52 AM Host ovmsrv05 was restarted by 
admin@internal-authz.
Jan 31, 2017 10:25:52 AM Host ovmsrv05 was started by 
admin@internal-authz.
Jan 31, 2017 10:25:52 AM Power management start of Host ovmsrv05 
succeeded.
Jan 31, 2017 10:25:50 AM Executing power management status on Host 
ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Jan 31, 2017 10:25:37 AM Executing power management start on Host 
ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Jan 31, 2017 10:25:37 AM Power management start of Host ovmsrv05 
initiated.

Jan 31, 2017 10:25:37 AM Auto fence for host ovmsrv05 was started.
Jan 31, 2017 10:25:37 AM All VMs' status on Non Responsive Host 
ovmsrv05 were changed to 'Down' by admin@internal-authz
Jan 31, 2017 10:25:36 AM Host ovmsrv05 was stopped by 
admin@internal-authz.

Jan 31, 2017 10:25:36 AM Power management stop of Host ovmsrv05 succeeded.
Jan 31, 2017 10:25:34 AM Executing power management status on Host 
ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Jan 31, 2017 10:25:15 AM Executing power management stop on Host 
ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.

Jan 31, 2017 10:25:15 AM Power management stop of Host ovmsrv05 initiated.
Jan 31, 2017 10:25:12 AM Executing power management status on Host 
ovmsrv05 using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.


Watching the timestamps, the culprit seems the reboot time of ovmsrv05 
that detects some LUNs in owned state and other ones in unowned

Full messages of both hosts here:
https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NMRlU/view?usp=sharing
and
https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdXTms/view?usp=sharing

At this time there are 4 LUNs globally seen by the two hosts but only 
1 of them is currently configured as the only storage domain in oVirt 
cluster.


[root@ovmsrv05 ~]# multipath -l | grep ^36
3600a0b8000299aa8d08b55014119 dm-5 IBM ,1814  FAStT
3600a0b8000299902cd3c5501458f dm-3 IBM ,1814  FAStT
3600a0b8000299902ccf855011198 dm-2 IBM ,1814  FAStT
3600a0b8000299aa8d08955014098 dm-4 IBM ,1814  FAStT

the configured one:
[root@ovmsrv05 ~]# multipath -l 3600a0b8000299aa8d08b55014119
3600a0b8000299aa8d08b55014119 dm-5 IBM ,1814  FAStT
size=4.0T features='0' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| |- 0:0:1:3 sdl 8:176 active undef running
| `- 2:0:1:3 sdp 8:240 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  |- 0:0:0:3 sdd 8:48  active undef running
  `- 2:0:0:3 sdi 8:128 active undef running

In mesages of booting node, arounf the problem registered by the storage:
[root@ovmsrv05 ~]# grep owned /var/log/messages
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:1: rdac: LUN 1 (RDAC) (owned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:2: rdac: LUN 2 (RDAC) (owned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:3: rdac: LUN 3 (RDAC) 
(unowned)

Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:1: rdac: LUN 1 (RDAC) (owned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:4: rdac: LUN 4 (RDAC) 
(unowned)

Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:2: rdac: LUN 2 (RDAC) (owned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:1: rdac: LUN 1 (RDAC) 
(unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:3: rdac: LUN 3 (RDAC) 
(unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:4: rdac: LUN 4 (RDAC) 
(unowned)
Jan 31 

[ovirt-users] VM has been paused due to storage I/O problem

2017-01-31 Thread Gianluca Cecchi
Hello,
my test environment is composed by 2 old HP blades BL685c G1 (ovmsrv05 and
ovmsrv06) and they are connected in a SAN with FC-switches to an old IBM
DS4700 storage array.
Apart from being old, they seem all ok from an hw point of view.
I have configured oVirt 4.0.6 and an FCP storage domain.
The hosts are plain CentOS 7.3 servers fully updated.
It is not an hosted engine environment: the manager is a vm outside of the
cluster.
I have configured power mgmt on both and it works good.

I have at the moment  only one VM for test and it is doing quite nothing.

Starting point: ovmsrv05 is in maintenance (since about 2 days) and the VM
is running on ovmsrv06.
I update qemu-kvm package on ovmsrv05 and then I restart it from web admin
gui:
Power Mgmt --> Restart

Sequence of events in pane and the problem in subject:
Jan 31, 2017 10:29:43 AM Host ovmsrv05 power management was verified
successfully.
Jan 31, 2017 10:29:43 AM Status of host ovmsrv05 was set to Up.
Jan 31, 2017 10:29:38 AM Executing power management status on Host ovmsrv05
using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Jan 31, 2017 10:29:29 AM Activation of host ovmsrv05 initiated by
admin@internal-authz.
Jan 31, 2017 10:28:05 AM VM ol65 has recovered from paused back to up.
Jan 31, 2017 10:27:55 AM VM ol65 has been paused due to storage I/O problem.
Jan 31, 2017 10:27:55 AM VM ol65 has been paused.
Jan 31, 2017 10:25:52 AM Host ovmsrv05 was restarted by
admin@internal-authz.
Jan 31, 2017 10:25:52 AM Host ovmsrv05 was started by admin@internal-authz.
Jan 31, 2017 10:25:52 AM Power management start of Host ovmsrv05 succeeded.
Jan 31, 2017 10:25:50 AM Executing power management status on Host ovmsrv05
using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Jan 31, 2017 10:25:37 AM Executing power management start on Host ovmsrv05
using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Jan 31, 2017 10:25:37 AM Power management start of Host ovmsrv05 initiated.
Jan 31, 2017 10:25:37 AM Auto fence for host ovmsrv05 was started.
Jan 31, 2017 10:25:37 AM All VMs' status on Non Responsive Host ovmsrv05
were changed to 'Down' by admin@internal-authz
Jan 31, 2017 10:25:36 AM Host ovmsrv05 was stopped by admin@internal-authz.
Jan 31, 2017 10:25:36 AM Power management stop of Host ovmsrv05 succeeded.
Jan 31, 2017 10:25:34 AM Executing power management status on Host ovmsrv05
using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Jan 31, 2017 10:25:15 AM Executing power management stop on Host ovmsrv05
using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.
Jan 31, 2017 10:25:15 AM Power management stop of Host ovmsrv05 initiated.
Jan 31, 2017 10:25:12 AM Executing power management status on Host ovmsrv05
using Proxy Host ovmsrv06 and Fence Agent ilo:10.4.192.212.

Watching the timestamps, the culprit seems the reboot time of ovmsrv05 that
detects some LUNs in owned state and other ones in unowned
Full messages of both hosts here:
https://drive.google.com/file/d/0BwoPbcrMv8mvekZQT1pjc0NMRlU/view?usp=sharing
and
https://drive.google.com/file/d/0BwoPbcrMv8mvcjBCYVdFZWdXTms/view?usp=sharing

At this time there are 4 LUNs globally seen by the two hosts but only 1 of
them is currently configured as the only storage domain in oVirt cluster.

[root@ovmsrv05 ~]# multipath -l | grep ^36
3600a0b8000299aa8d08b55014119 dm-5 IBM ,1814  FAStT
3600a0b8000299902cd3c5501458f dm-3 IBM ,1814  FAStT
3600a0b8000299902ccf855011198 dm-2 IBM ,1814  FAStT
3600a0b8000299aa8d08955014098 dm-4 IBM ,1814  FAStT

the configured one:
[root@ovmsrv05 ~]# multipath -l 3600a0b8000299aa8d08b55014119
3600a0b8000299aa8d08b55014119 dm-5 IBM ,1814  FAStT
size=4.0T features='0' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| |- 0:0:1:3 sdl 8:176 active undef running
| `- 2:0:1:3 sdp 8:240 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  |- 0:0:0:3 sdd 8:48  active undef running
  `- 2:0:0:3 sdi 8:128 active undef running

In mesages of booting node, arounf the problem registered by the storage:
[root@ovmsrv05 ~]# grep owned /var/log/messages
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:1: rdac: LUN 1 (RDAC) (owned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:2: rdac: LUN 2 (RDAC) (owned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:3: rdac: LUN 3 (RDAC) (unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:1: rdac: LUN 1 (RDAC) (owned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:0:4: rdac: LUN 4 (RDAC) (unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:2: rdac: LUN 2 (RDAC) (owned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:1: rdac: LUN 1 (RDAC) (unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:3: rdac: LUN 3 (RDAC) (unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:0:4: rdac: LUN 4 (RDAC) (unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 0:0:1:2: rdac: LUN 2 (RDAC) (unowned)
Jan 31 10:27:38 ovmsrv05 kernel: scsi 2:0:1:1: rdac: LUN 1 (RDAC) (unowned)