Re: [ovirt-users] qcow2 images corruption

2018-02-18 Thread Nir Soffer
On Wed, Feb 7, 2018 at 7:09 PM Nicolas Ecarnot  wrote:

> Hello,
>
> TL; DR : qcow2 images keep getting corrupted. Any workaround?
>
> Long version:
> This discussion has already been launched by me on the oVirt and on
> qemu-block mailing list, under similar circumstances but I learned
> further things since months and here are some informations :
>
> - We are using 2 oVirt 3.6.7.5-1.el7.centos datacenters, using CentOS
> 7.{2,3} hosts
> - Hosts :
>- CentOS 7.2 1511 :
>  - Kernel = 3.10.0 327
>  - KVM : 2.3.0-31
>  - libvirt : 1.2.17
>  - vdsm : 4.17.32-1
>- CentOS 7.3 1611 :
>  - Kernel 3.10.0 514
>  - KVM : 2.3.0-31
>  - libvirt 2.0.0-10
>  - vdsm : 4.17.32-1
> - Our storage is 2 Equallogic SANs connected via iSCSI on a dedicated
> network
>

In 3.6 and iSCSI storage you have the issue of lvmetad service, activating
oVirt volumes by default, and also activating guest lvs inside oVirt raw
volumes.
This can lead to data corruption if an lv was activated before it was
extended
on another host, and the lv size on the host does not reflect the actual lv
size.
We had many bugs related to this, check this for related bugs:
https://bugzilla.redhat.com/1374545

To avoid this issue, you need to

1. edit /etc/lvm/lvm.conf global/use_lvmetad to:

use_lvmetad = 0

2. disable and mask these services:

- lvm2-lvmetad.socket
- lvm2-lvmetad.service

Note that this will may cause warnings from systemd during boot, the
warnings
are harmless:
https://bugzilla.redhat.com/1462792

For extra safety and better performance, you should also setup lvm filter
on all hosts.

Check this for example how it is done in 4.x:
https://www.ovirt.org/blog/2017/12/lvm-configuration-the-easy-way/

Since you run 3.6 you will have to setup the filter manually in the same
way.

Nir


> - Depends on weeks, but all in all, there are around 32 hosts, 8 storage
> domains and for various reasons, very few VMs (less than 200).
> - One peculiar point is that most of our VMs are provided an additional
> dedicated network interface that is iSCSI-connected to some volumes of
> our SAN - these volumes not being part of the oVirt setup. That could
> lead to a lot of additional iSCSI traffic.
>
>  From times to times, a random VM appears paused by oVirt.
> Digging into the oVirt engine logs, then into the host vdsm logs, it
> appears that the host considers the qcow2 image as corrupted.
> Along what I consider as a conservative behavior, vdsm stops any
> interaction with this image and marks it as paused.
> Any try to unpause it leads to the same conservative pause.
>
> After having found (https://access.redhat.com/solutions/1173623) the
> right logical volume hosting the qcow2 image, I can run qemu-img check
> on it.
> - On 80% of my VMs, I find no errors.
> - On 15% of them, I find Leaked cluster errors that I can correct using
> "qemu-img check -r all"
> - On 5% of them, I find Leaked clusters errors and further fatal errors,
> which can not be corrected with qemu-img.
> In rare cases, qemu-img can correct them, but destroys large parts of
> the image (becomes unusable), and on other cases it can not correct them
> at all.
>
> Months ago, I already sent a similar message but the error message was
> about No space left on device
> (https://www.mail-archive.com/qemu-block@gnu.org/msg00110.html).
>
> This time, I don't have this message about space, but only corruption.
>
> I kept reading and found a similar discussion in the Proxmox group :
> https://lists.ovirt.org/pipermail/users/2018-February/086750.html
>
>
> https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/page-2
>
> What I read similar to my case is :
> - usage of qcow2
> - heavy disk I/O
> - using the virtio-blk driver
>
> In the proxmox thread, they tend to say that using virtio-scsi is the
> solution. Having asked this question to oVirt experts
> (https://lists.ovirt.org/pipermail/users/2018-February/086753.html) but
> it's not clear the driver is to blame.
>
> I agree with the answer Yaniv Kaul gave to me, saying I have to properly
> report the issue, so I'm longing to know which peculiar information I
> can give you now.
>
> As you can imagine, all this setup is in production, and for most of the
> VMs, I can not "play" with them. Moreover, we launched a campaign of
> nightly stopping every VM, qemu-img check them one by one, then boot.
> So it might take some time before I find another corrupted image.
> (which I'll preciously store for debug)
>
> Other informations : We very rarely do snapshots, but I'm close to
> imagine that automated migrations of VMs could trigger similar behaviors
> on qcow2 images.
>
> Last point about the versions we use : yes that's old, yes we're
> planning to upgrade, but we don't know when.
>
> Regards,
>
> --
> Nicolas ECARNOT
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>

Re: [ovirt-users] qcow2 images corruption

2018-02-08 Thread Nicolas Ecarnot

Le 08/02/2018 à 13:59, Yaniv Kaul a écrit :



On Feb 7, 2018 7:08 PM, "Nicolas Ecarnot" > wrote:


Hello,

TL; DR : qcow2 images keep getting corrupted. Any workaround?

Long version:
This discussion has already been launched by me on the oVirt and
on qemu-block mailing list, under similar circumstances but I
learned further things since months and here are some informations :

- We are using 2 oVirt 3.6.7.5-1.el7.centos datacenters, using
CentOS 7.{2,3} hosts
- Hosts :
  - CentOS 7.2 1511 :
    - Kernel = 3.10.0 327
    - KVM : 2.3.0-31
    - libvirt : 1.2.17
    - vdsm : 4.17.32-1
  - CentOS 7.3 1611 :
    - Kernel 3.10.0 514
    - KVM : 2.3.0-31
    - libvirt 2.0.0-10
    - vdsm : 4.17.32-1


All are somewhat old releases. I suggest upgrading to the latest RHEL 
and qemu-kvm bits.


Later on, upgrade oVirt.
Y.

Hello Yaniv,

We could discuss for hours about the fact that CentOS 7.3 was released 
in January 2017, thus not that old.
And also discuss for hours explaining the gap between developers' will 
to push their freshest releases and the curb we - industry users - put 
on adopting such new versions. In my case, the virtualization 
infrastructure is just one of the +30 domains I have to master everyday, 
and the more stable the better.
In the setup described previously, the qemu qcow2 images were correct, 
then not. We did not change anything. We have to find a workaround and 
we need your expertise.


Not understanding the cause of the corruption threatens us to the same 
situation in oVirt 4.2.


--
Nicolas Ecarnot
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] qcow2 images corruption

2018-02-08 Thread Yaniv Kaul
On Feb 7, 2018 7:08 PM, "Nicolas Ecarnot"  wrote:

Hello,

TL; DR : qcow2 images keep getting corrupted. Any workaround?

Long version:
This discussion has already been launched by me on the oVirt and on
qemu-block mailing list, under similar circumstances but I learned further
things since months and here are some informations :

- We are using 2 oVirt 3.6.7.5-1.el7.centos datacenters, using CentOS
7.{2,3} hosts
- Hosts :
  - CentOS 7.2 1511 :
- Kernel = 3.10.0 327
- KVM : 2.3.0-31
- libvirt : 1.2.17
- vdsm : 4.17.32-1
  - CentOS 7.3 1611 :
- Kernel 3.10.0 514
- KVM : 2.3.0-31
- libvirt 2.0.0-10
- vdsm : 4.17.32-1


All are somewhat old releases. I suggest upgrading to the latest RHEL and
qemu-kvm bits.

Later on, upgrade oVirt.
Y.

- Our storage is 2 Equallogic SANs connected via iSCSI on a dedicated
network
- Depends on weeks, but all in all, there are around 32 hosts, 8 storage
domains and for various reasons, very few VMs (less than 200).
- One peculiar point is that most of our VMs are provided an additional
dedicated network interface that is iSCSI-connected to some volumes of our
SAN - these volumes not being part of the oVirt setup. That could lead to a
lot of additional iSCSI traffic.

>From times to times, a random VM appears paused by oVirt.
Digging into the oVirt engine logs, then into the host vdsm logs, it
appears that the host considers the qcow2 image as corrupted.
Along what I consider as a conservative behavior, vdsm stops any
interaction with this image and marks it as paused.
Any try to unpause it leads to the same conservative pause.

After having found (https://access.redhat.com/solutions/1173623) the right
logical volume hosting the qcow2 image, I can run qemu-img check on it.
- On 80% of my VMs, I find no errors.
- On 15% of them, I find Leaked cluster errors that I can correct using
"qemu-img check -r all"
- On 5% of them, I find Leaked clusters errors and further fatal errors,
which can not be corrected with qemu-img.
In rare cases, qemu-img can correct them, but destroys large parts of the
image (becomes unusable), and on other cases it can not correct them at all.

Months ago, I already sent a similar message but the error message was
about No space left on device (https://www.mail-archive.com/
qemu-bl...@gnu.org/msg00110.html).

This time, I don't have this message about space, but only corruption.

I kept reading and found a similar discussion in the Proxmox group :
https://lists.ovirt.org/pipermail/users/2018-February/086750.html

https://forum.proxmox.com/threads/qcow2-corruption-after-
snapshot-or-heavy-disk-i-o.32865/page-2

What I read similar to my case is :
- usage of qcow2
- heavy disk I/O
- using the virtio-blk driver

In the proxmox thread, they tend to say that using virtio-scsi is the
solution. Having asked this question to oVirt experts (
https://lists.ovirt.org/pipermail/users/2018-February/086753.html) but it's
not clear the driver is to blame.

I agree with the answer Yaniv Kaul gave to me, saying I have to properly
report the issue, so I'm longing to know which peculiar information I can
give you now.

As you can imagine, all this setup is in production, and for most of the
VMs, I can not "play" with them. Moreover, we launched a campaign of
nightly stopping every VM, qemu-img check them one by one, then boot.
So it might take some time before I find another corrupted image.
(which I'll preciously store for debug)

Other informations : We very rarely do snapshots, but I'm close to imagine
that automated migrations of VMs could trigger similar behaviors on qcow2
images.

Last point about the versions we use : yes that's old, yes we're planning
to upgrade, but we don't know when.

Regards,

-- 
Nicolas ECARNOT
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] qcow2 images corruption

2018-02-07 Thread Nicolas Ecarnot

Hello,

TL; DR : qcow2 images keep getting corrupted. Any workaround?

Long version:
This discussion has already been launched by me on the oVirt and on 
qemu-block mailing list, under similar circumstances but I learned 
further things since months and here are some informations :


- We are using 2 oVirt 3.6.7.5-1.el7.centos datacenters, using CentOS 
7.{2,3} hosts

- Hosts :
  - CentOS 7.2 1511 :
- Kernel = 3.10.0 327
- KVM : 2.3.0-31
- libvirt : 1.2.17
- vdsm : 4.17.32-1
  - CentOS 7.3 1611 :
- Kernel 3.10.0 514
- KVM : 2.3.0-31
- libvirt 2.0.0-10
- vdsm : 4.17.32-1
- Our storage is 2 Equallogic SANs connected via iSCSI on a dedicated 
network
- Depends on weeks, but all in all, there are around 32 hosts, 8 storage 
domains and for various reasons, very few VMs (less than 200).
- One peculiar point is that most of our VMs are provided an additional 
dedicated network interface that is iSCSI-connected to some volumes of 
our SAN - these volumes not being part of the oVirt setup. That could 
lead to a lot of additional iSCSI traffic.


From times to times, a random VM appears paused by oVirt.
Digging into the oVirt engine logs, then into the host vdsm logs, it 
appears that the host considers the qcow2 image as corrupted.
Along what I consider as a conservative behavior, vdsm stops any 
interaction with this image and marks it as paused.

Any try to unpause it leads to the same conservative pause.

After having found (https://access.redhat.com/solutions/1173623) the 
right logical volume hosting the qcow2 image, I can run qemu-img check 
on it.

- On 80% of my VMs, I find no errors.
- On 15% of them, I find Leaked cluster errors that I can correct using 
"qemu-img check -r all"
- On 5% of them, I find Leaked clusters errors and further fatal errors, 
which can not be corrected with qemu-img.
In rare cases, qemu-img can correct them, but destroys large parts of 
the image (becomes unusable), and on other cases it can not correct them 
at all.


Months ago, I already sent a similar message but the error message was 
about No space left on device 
(https://www.mail-archive.com/qemu-block@gnu.org/msg00110.html).


This time, I don't have this message about space, but only corruption.

I kept reading and found a similar discussion in the Proxmox group :
https://lists.ovirt.org/pipermail/users/2018-February/086750.html

https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/page-2

What I read similar to my case is :
- usage of qcow2
- heavy disk I/O
- using the virtio-blk driver

In the proxmox thread, they tend to say that using virtio-scsi is the 
solution. Having asked this question to oVirt experts 
(https://lists.ovirt.org/pipermail/users/2018-February/086753.html) but 
it's not clear the driver is to blame.


I agree with the answer Yaniv Kaul gave to me, saying I have to properly 
report the issue, so I'm longing to know which peculiar information I 
can give you now.


As you can imagine, all this setup is in production, and for most of the 
VMs, I can not "play" with them. Moreover, we launched a campaign of 
nightly stopping every VM, qemu-img check them one by one, then boot.

So it might take some time before I find another corrupted image.
(which I'll preciously store for debug)

Other informations : We very rarely do snapshots, but I'm close to 
imagine that automated migrations of VMs could trigger similar behaviors 
on qcow2 images.


Last point about the versions we use : yes that's old, yes we're 
planning to upgrade, but we don't know when.


Regards,

--
Nicolas ECARNOT
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] qcow2 images corruption

2018-02-07 Thread Nicolas Ecarnot

Hello,

TL; DR : qcow2 images keep getting corrupted. Any workaround?

Long version:
This discussion has already been launched by me on the oVirt and on 
qemu-block mailing list, under similar circumstances but I learned 
further things since months and here are some informations :


- We are using 2 oVirt 3.6.7.5-1.el7.centos datacenters, using CentOS 
7.{2,3} hosts

- Hosts :
  - CentOS 7.2 1511 :
- Kernel = 3.10.0 327
- KVM : 2.3.0-31
- libvirt : 1.2.17
- vdsm : 4.17.32-1
  - CentOS 7.3 1611 :
- Kernel 3.10.0 514
- KVM : 2.3.0-31
- libvirt 2.0.0-10
- vdsm : 4.17.32-1
- Our storage is 2 Equallogic SANs connected via iSCSI on a dedicated 
network
- Depends on weeks, but all in all, there are around 32 hosts, 8 storage 
domains and for various reasons, very few VMs (less than 200).
- One peculiar point is that most of our VMs are provided an additional 
dedicated network interface that is iSCSI-connected to some volumes of 
our SAN - these volumes not being part of the oVirt setup. That could 
lead to a lot of additional iSCSI traffic.


From times to times, a random VM appears paused by oVirt.
Digging into the oVirt engine logs, then into the host vdsm logs, it 
appears that the host considers the qcow2 image as corrupted.
Along what I consider as a conservative behavior, vdsm stops any 
interaction with this image and marks it as paused.

Any try to unpause it leads to the same conservative pause.

After having found (https://access.redhat.com/solutions/1173623) the 
right logical volume hosting the qcow2 image, I can run qemu-img check 
on it.

- On 80% of my VMs, I find no errors.
- On 15% of them, I find Leaked cluster errors that I can correct using 
"qemu-img check -r all"
- On 5% of them, I find Leaked clusters errors and further fatal errors, 
which can not be corrected with qemu-img.
In rare cases, qemu-img can correct them, but destroys large parts of 
the image (becomes unusable), and on other cases it can not correct them 
at all.


Months ago, I already sent a similar message but the error message was 
about No space left on device 
(https://www.mail-archive.com/qemu-block@gnu.org/msg00110.html).


This time, I don't have this message about space, but only corruption.

I kept reading and found a similar discussion in the Proxmox group :
https://lists.ovirt.org/pipermail/users/2018-February/086750.html

https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/page-2

What I read similar to my case is :
- usage of qcow2
- heavy disk I/O
- using the virtio-blk driver

In the proxmox thread, they tend to say that using virtio-scsi is the 
solution. Having asked this question to oVirt experts 
(https://lists.ovirt.org/pipermail/users/2018-February/086753.html) but 
it's not clear the driver is to blame.


I agree with the answer Yaniv Kaul gave to me, saying I have to properly 
report the issue, so I'm longing to know which peculiar information I 
can give you now.


As you can imagine, all this setup is in production, and for most of the 
VMs, I can not "play" with them. Moreover, we launched a campaign of 
nightly stopping every VM, qemu-img check them one by one, then boot.

So it might take some time before I find another corrupted image.
(which I'll preciously store for debug)

Other informations : We very rarely do snapshots, but I'm close to 
imagine that automated migrations of VMs could trigger similar behaviors 
on qcow2 images.


Last point about the versions we use : yes that's old, yes we're 
planning to upgrade, but we don't know when.


Regards,

--
Nicolas ECARNOT
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users