Re: [ovirt-users] One RHEV Virtual Machine does not Automatically Resume following Compellent SAN Controller Failover
Can you reply on my question? Yaniv Dary Technical Product Manager Red Hat Israel Ltd. 34 Jerusalem Road Building A, 4th floor Ra'anana, Israel 4350109 Tel : +972 (9) 7692306 8272306 Email: yd...@redhat.com IRC : ydary On Thu, May 26, 2016 at 9:14 AM, Yaniv Darywrote: > What DR solution are you using? > > Yaniv Dary > Technical Product Manager > Red Hat Israel Ltd. > 34 Jerusalem Road > Building A, 4th floor > Ra'anana, Israel 4350109 > > Tel : +972 (9) 7692306 > 8272306 > Email: yd...@redhat.com > IRC : ydary > > > On Wed, Nov 25, 2015 at 1:15 PM, Simone Tiraboschi > wrote: > >> Adding Nir who knows it far better than me. >> >> >> On Mon, Nov 23, 2015 at 8:37 PM, Duckworth, Douglas C >> wrote: >> >>> Hello -- >>> >>> Not sure if y'all can help with this issue we've been seeing with RHEV... >>> >>> On 11/13/2015, during Code Upgrade of Compellent SAN at our Disaster >>> Recovery Site, we Failed Over to Secondary SAN Controller. Most Virtual >>> Machines in our DR Cluster Resumed automatically after Pausing except VM >>> "BADVM" on Host "BADHOST." >>> >>> In Engine.log you can see that BADVM was sent into "VM_PAUSED_EIO" state >>> at 10:47:57: >>> >>> "VM BADVM has paused due to storage I/O problem." >>> >>> On this Red Hat Enterprise Virtualization Hypervisor 6.6 >>> (20150512.0.el6ev) Host, two other VMs paused but then automatically >>> resumed without System Administrator intervention... >>> >>> In our DR Cluster, 22 VMs also resumed automatically... >>> >>> None of these Guest VMs are engaged in high I/O as these are DR site VMs >>> not currently doing anything. >>> >>> We sent this information to Dell. Their response: >>> >>> "The root cause may reside within your virtualization solution, not the >>> parent OS (RHEV-Hypervisor disc) or Storage (Dell Compellent.)" >>> >>> We are doing this Failover again on Sunday November 29th so we would >>> like to know how to mitigate this issue, given we have to manually >>> resume paused VMs that don't resume automatically. >>> >>> Before we initiated SAN Controller Failover, all iSCSI paths to Targets >>> were present on Host tulhv2p03. >>> >>> VM logs on Host show in /var/log/libvirt/qemu/badhost.log that Storage >>> error was reported: >>> >>> block I/O error in device 'drive-virtio-disk0': Input/output error (5) >>> block I/O error in device 'drive-virtio-disk0': Input/output error (5) >>> block I/O error in device 'drive-virtio-disk0': Input/output error (5) >>> block I/O error in device 'drive-virtio-disk0': Input/output error (5) >>> >>> All disks used by this Guest VM are provided by single Storage Domain >>> COM_3TB4_DR with serial "270." In syslog we do see that all paths for >>> that Storage Domain Failed: >>> >>> Nov 13 16:47:40 multipathd: 36000d310005caf000270: remaining >>> active paths: 0 >>> >>> Though these recovered later: >>> >>> Nov 13 16:59:17 multipathd: 36000d310005caf000270: sdbg - >>> tur checker reports path is up >>> Nov 13 16:59:17 multipathd: 36000d310005caf000270: remaining >>> active paths: 8 >>> >>> Does anyone have an idea of why the VM would fail to automatically >>> resume if the iSCSI paths used by its Storage Domain recovered? >>> >>> Thanks >>> Doug >>> >>> -- >>> Thanks >>> >>> Douglas Charles Duckworth >>> Unix Administrator >>> Tulane University >>> Technology Services >>> 1555 Poydras Ave >>> NOLA -- 70112 >>> >>> E: du...@tulane.edu >>> O: 504-988-9341 >>> F: 504-988-8505 >>> ___ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users >>> >> >> >> ___ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >> >> > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] One RHEV Virtual Machine does not Automatically Resume following Compellent SAN Controller Failover
We see exactly the same, and it does not seem to be Vendor dependend. - Equallogic Controller Failover -> VM get paused and maybe unpaused but most dont - Nexenta ZFS iSCSI with RSF1 HA -> same - FreeBSD ctld iscsi-target + Heartbeat -> same - CentOS + iscsi-target + Heartbeat -> same Multipath Settings are, where available, modified to match the best practice supplied by the Vendor. On Open Source Solutions we started with known working multipath/iscsi Settings, and meanwhile nearly every possible setting has been tested. Without much success. To me it looks like Ovirt/Rhev is way to sensitive to iSCSI Interruptions, and it feels like gambling what the engine might do to your VM (or not). Am 11/23/2015 um 8:37 PM schrieb Duckworth, Douglas C: > Hello -- > > Not sure if y'all can help with this issue we've been seeing with RHEV... > > On 11/13/2015, during Code Upgrade of Compellent SAN at our Disaster > Recovery Site, we Failed Over to Secondary SAN Controller. Most Virtual > Machines in our DR Cluster Resumed automatically after Pausing except VM > "BADVM" on Host "BADHOST." > > In Engine.log you can see that BADVM was sent into "VM_PAUSED_EIO" state > at 10:47:57: > > "VM BADVM has paused due to storage I/O problem." > > On this Red Hat Enterprise Virtualization Hypervisor 6.6 > (20150512.0.el6ev) Host, two other VMs paused but then automatically > resumed without System Administrator intervention... > > In our DR Cluster, 22 VMs also resumed automatically... > > None of these Guest VMs are engaged in high I/O as these are DR site VMs > not currently doing anything. > > We sent this information to Dell. Their response: > > "The root cause may reside within your virtualization solution, not the > parent OS (RHEV-Hypervisor disc) or Storage (Dell Compellent.)" > > We are doing this Failover again on Sunday November 29th so we would > like to know how to mitigate this issue, given we have to manually > resume paused VMs that don't resume automatically. > > Before we initiated SAN Controller Failover, all iSCSI paths to Targets > were present on Host tulhv2p03. > > VM logs on Host show in /var/log/libvirt/qemu/badhost.log that Storage > error was reported: > > block I/O error in device 'drive-virtio-disk0': Input/output error (5) > block I/O error in device 'drive-virtio-disk0': Input/output error (5) > block I/O error in device 'drive-virtio-disk0': Input/output error (5) > block I/O error in device 'drive-virtio-disk0': Input/output error (5) > > All disks used by this Guest VM are provided by single Storage Domain > COM_3TB4_DR with serial "270." In syslog we do see that all paths for > that Storage Domain Failed: > > Nov 13 16:47:40 multipathd: 36000d310005caf000270: remaining > active paths: 0 > > Though these recovered later: > > Nov 13 16:59:17 multipathd: 36000d310005caf000270: sdbg - > tur checker reports path is up > Nov 13 16:59:17 multipathd: 36000d310005caf000270: remaining > active paths: 8 > > Does anyone have an idea of why the VM would fail to automatically > resume if the iSCSI paths used by its Storage Domain recovered? > > Thanks > Doug > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] One RHEV Virtual Machine does not Automatically Resume following Compellent SAN Controller Failover
On Mon, Nov 23, 2015 at 9:37 PM, Duckworth, Douglas Cwrote: > Hello -- > > Not sure if y'all can help with this issue we've been seeing with RHEV... > > On 11/13/2015, during Code Upgrade of Compellent SAN at our Disaster > Recovery Site, we Failed Over to Secondary SAN Controller. Most Virtual > Machines in our DR Cluster Resumed automatically after Pausing except VM > "BADVM" on Host "BADHOST." > > In Engine.log you can see that BADVM was sent into "VM_PAUSED_EIO" state > at 10:47:57: > > "VM BADVM has paused due to storage I/O problem." > > On this Red Hat Enterprise Virtualization Hypervisor 6.6 > (20150512.0.el6ev) Host, two other VMs paused but then automatically > resumed without System Administrator intervention... > > In our DR Cluster, 22 VMs also resumed automatically... > > None of these Guest VMs are engaged in high I/O as these are DR site VMs > not currently doing anything. > > We sent this information to Dell. Their response: > > "The root cause may reside within your virtualization solution, not the > parent OS (RHEV-Hypervisor disc) or Storage (Dell Compellent.)" > > We are doing this Failover again on Sunday November 29th so we would > like to know how to mitigate this issue, given we have to manually > resume paused VMs that don't resume automatically. > > Before we initiated SAN Controller Failover, all iSCSI paths to Targets > were present on Host tulhv2p03. > > VM logs on Host show in /var/log/libvirt/qemu/badhost.log that Storage > error was reported: > > block I/O error in device 'drive-virtio-disk0': Input/output error (5) > block I/O error in device 'drive-virtio-disk0': Input/output error (5) > block I/O error in device 'drive-virtio-disk0': Input/output error (5) > block I/O error in device 'drive-virtio-disk0': Input/output error (5) > > All disks used by this Guest VM are provided by single Storage Domain > COM_3TB4_DR with serial "270." In syslog we do see that all paths for > that Storage Domain Failed: > > Nov 13 16:47:40 multipathd: 36000d310005caf000270: remaining > active paths: 0 > > Though these recovered later: > > Nov 13 16:59:17 multipathd: 36000d310005caf000270: sdbg - > tur checker reports path is up > Nov 13 16:59:17 multipathd: 36000d310005caf000270: remaining > active paths: 8 > > Does anyone have an idea of why the VM would fail to automatically > resume if the iSCSI paths used by its Storage Domain recovered? > Look at the vdsm.log for events which libvirt emits and the actions that vdsm takes on them. One of the actions would be to unpause the VM AFAIR. If you didn't see this then QEMU/libvirt failed to propagatate the new state change or it might be deeper down the stack. If there are events there then share the vdsm logs. > > Thanks > Doug > > -- > Thanks > > Douglas Charles Duckworth > Unix Administrator > Tulane University > Technology Services > 1555 Poydras Ave > NOLA -- 70112 > > E: du...@tulane.edu > O: 504-988-9341 > F: 504-988-8505 > ___ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] One RHEV Virtual Machine does not Automatically Resume following Compellent SAN Controller Failover
What DR solution are you using? Yaniv Dary Technical Product Manager Red Hat Israel Ltd. 34 Jerusalem Road Building A, 4th floor Ra'anana, Israel 4350109 Tel : +972 (9) 7692306 8272306 Email: yd...@redhat.com IRC : ydary On Wed, Nov 25, 2015 at 1:15 PM, Simone Tiraboschiwrote: > Adding Nir who knows it far better than me. > > > On Mon, Nov 23, 2015 at 8:37 PM, Duckworth, Douglas C > wrote: > >> Hello -- >> >> Not sure if y'all can help with this issue we've been seeing with RHEV... >> >> On 11/13/2015, during Code Upgrade of Compellent SAN at our Disaster >> Recovery Site, we Failed Over to Secondary SAN Controller. Most Virtual >> Machines in our DR Cluster Resumed automatically after Pausing except VM >> "BADVM" on Host "BADHOST." >> >> In Engine.log you can see that BADVM was sent into "VM_PAUSED_EIO" state >> at 10:47:57: >> >> "VM BADVM has paused due to storage I/O problem." >> >> On this Red Hat Enterprise Virtualization Hypervisor 6.6 >> (20150512.0.el6ev) Host, two other VMs paused but then automatically >> resumed without System Administrator intervention... >> >> In our DR Cluster, 22 VMs also resumed automatically... >> >> None of these Guest VMs are engaged in high I/O as these are DR site VMs >> not currently doing anything. >> >> We sent this information to Dell. Their response: >> >> "The root cause may reside within your virtualization solution, not the >> parent OS (RHEV-Hypervisor disc) or Storage (Dell Compellent.)" >> >> We are doing this Failover again on Sunday November 29th so we would >> like to know how to mitigate this issue, given we have to manually >> resume paused VMs that don't resume automatically. >> >> Before we initiated SAN Controller Failover, all iSCSI paths to Targets >> were present on Host tulhv2p03. >> >> VM logs on Host show in /var/log/libvirt/qemu/badhost.log that Storage >> error was reported: >> >> block I/O error in device 'drive-virtio-disk0': Input/output error (5) >> block I/O error in device 'drive-virtio-disk0': Input/output error (5) >> block I/O error in device 'drive-virtio-disk0': Input/output error (5) >> block I/O error in device 'drive-virtio-disk0': Input/output error (5) >> >> All disks used by this Guest VM are provided by single Storage Domain >> COM_3TB4_DR with serial "270." In syslog we do see that all paths for >> that Storage Domain Failed: >> >> Nov 13 16:47:40 multipathd: 36000d310005caf000270: remaining >> active paths: 0 >> >> Though these recovered later: >> >> Nov 13 16:59:17 multipathd: 36000d310005caf000270: sdbg - >> tur checker reports path is up >> Nov 13 16:59:17 multipathd: 36000d310005caf000270: remaining >> active paths: 8 >> >> Does anyone have an idea of why the VM would fail to automatically >> resume if the iSCSI paths used by its Storage Domain recovered? >> >> Thanks >> Doug >> >> -- >> Thanks >> >> Douglas Charles Duckworth >> Unix Administrator >> Tulane University >> Technology Services >> 1555 Poydras Ave >> NOLA -- 70112 >> >> E: du...@tulane.edu >> O: 504-988-9341 >> F: 504-988-8505 >> ___ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >> > > > ___ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] One RHEV Virtual Machine does not Automatically Resume following Compellent SAN Controller Failover
Adding Nir who knows it far better than me. On Mon, Nov 23, 2015 at 8:37 PM, Duckworth, Douglas Cwrote: > Hello -- > > Not sure if y'all can help with this issue we've been seeing with RHEV... > > On 11/13/2015, during Code Upgrade of Compellent SAN at our Disaster > Recovery Site, we Failed Over to Secondary SAN Controller. Most Virtual > Machines in our DR Cluster Resumed automatically after Pausing except VM > "BADVM" on Host "BADHOST." > > In Engine.log you can see that BADVM was sent into "VM_PAUSED_EIO" state > at 10:47:57: > > "VM BADVM has paused due to storage I/O problem." > > On this Red Hat Enterprise Virtualization Hypervisor 6.6 > (20150512.0.el6ev) Host, two other VMs paused but then automatically > resumed without System Administrator intervention... > > In our DR Cluster, 22 VMs also resumed automatically... > > None of these Guest VMs are engaged in high I/O as these are DR site VMs > not currently doing anything. > > We sent this information to Dell. Their response: > > "The root cause may reside within your virtualization solution, not the > parent OS (RHEV-Hypervisor disc) or Storage (Dell Compellent.)" > > We are doing this Failover again on Sunday November 29th so we would > like to know how to mitigate this issue, given we have to manually > resume paused VMs that don't resume automatically. > > Before we initiated SAN Controller Failover, all iSCSI paths to Targets > were present on Host tulhv2p03. > > VM logs on Host show in /var/log/libvirt/qemu/badhost.log that Storage > error was reported: > > block I/O error in device 'drive-virtio-disk0': Input/output error (5) > block I/O error in device 'drive-virtio-disk0': Input/output error (5) > block I/O error in device 'drive-virtio-disk0': Input/output error (5) > block I/O error in device 'drive-virtio-disk0': Input/output error (5) > > All disks used by this Guest VM are provided by single Storage Domain > COM_3TB4_DR with serial "270." In syslog we do see that all paths for > that Storage Domain Failed: > > Nov 13 16:47:40 multipathd: 36000d310005caf000270: remaining > active paths: 0 > > Though these recovered later: > > Nov 13 16:59:17 multipathd: 36000d310005caf000270: sdbg - > tur checker reports path is up > Nov 13 16:59:17 multipathd: 36000d310005caf000270: remaining > active paths: 8 > > Does anyone have an idea of why the VM would fail to automatically > resume if the iSCSI paths used by its Storage Domain recovered? > > Thanks > Doug > > -- > Thanks > > Douglas Charles Duckworth > Unix Administrator > Tulane University > Technology Services > 1555 Poydras Ave > NOLA -- 70112 > > E: du...@tulane.edu > O: 504-988-9341 > F: 504-988-8505 > ___ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] One RHEV Virtual Machine does not Automatically Resume following Compellent SAN Controller Failover
Hello -- Not sure if y'all can help with this issue we've been seeing with RHEV... On 11/13/2015, during Code Upgrade of Compellent SAN at our Disaster Recovery Site, we Failed Over to Secondary SAN Controller. Most Virtual Machines in our DR Cluster Resumed automatically after Pausing except VM "BADVM" on Host "BADHOST." In Engine.log you can see that BADVM was sent into "VM_PAUSED_EIO" state at 10:47:57: "VM BADVM has paused due to storage I/O problem." On this Red Hat Enterprise Virtualization Hypervisor 6.6 (20150512.0.el6ev) Host, two other VMs paused but then automatically resumed without System Administrator intervention... In our DR Cluster, 22 VMs also resumed automatically... None of these Guest VMs are engaged in high I/O as these are DR site VMs not currently doing anything. We sent this information to Dell. Their response: "The root cause may reside within your virtualization solution, not the parent OS (RHEV-Hypervisor disc) or Storage (Dell Compellent.)" We are doing this Failover again on Sunday November 29th so we would like to know how to mitigate this issue, given we have to manually resume paused VMs that don't resume automatically. Before we initiated SAN Controller Failover, all iSCSI paths to Targets were present on Host tulhv2p03. VM logs on Host show in /var/log/libvirt/qemu/badhost.log that Storage error was reported: block I/O error in device 'drive-virtio-disk0': Input/output error (5) block I/O error in device 'drive-virtio-disk0': Input/output error (5) block I/O error in device 'drive-virtio-disk0': Input/output error (5) block I/O error in device 'drive-virtio-disk0': Input/output error (5) All disks used by this Guest VM are provided by single Storage Domain COM_3TB4_DR with serial "270." In syslog we do see that all paths for that Storage Domain Failed: Nov 13 16:47:40 multipathd: 36000d310005caf000270: remaining active paths: 0 Though these recovered later: Nov 13 16:59:17 multipathd: 36000d310005caf000270: sdbg - tur checker reports path is up Nov 13 16:59:17 multipathd: 36000d310005caf000270: remaining active paths: 8 Does anyone have an idea of why the VM would fail to automatically resume if the iSCSI paths used by its Storage Domain recovered? Thanks Doug -- Thanks Douglas Charles Duckworth Unix Administrator Tulane University Technology Services 1555 Poydras Ave NOLA -- 70112 E: du...@tulane.edu O: 504-988-9341 F: 504-988-8505 ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users