Re: [Users] Storage unresponsive after sanlock

2014-01-29 Thread Maor Lipchuk
The VDSM log seems to be from the 26th and from the engine logs it seems
that the incident occurred at the 24th, so I can't really see the what
happened in VDSM that time.

From the engine logs it seems that at around 2014-01-24 16:59 the master
storage domain was in maintenance and then there was an attempt to
activate it, but VDSM threw an exception that it cannot find master
domain with the arguments of
spUUID=5849b030-626e-47cb-ad90-3ce782d831b3,
msdUUID=7c49750d-7eae-4cd2-9b63-1dc71f357b88'

This could be happen from various reasons, for example a failure in
connecting the storage (for example see https://bugzilla.redhat.com/782864)

Since you mentioned that once you have added a second node and it
worked, it seems like to origin of the problem is in the Host it self.

what are the differences between the two hosts (VDSM version, OS version)
Does the first host succeeded to work on other DC?
Have you tried to reinstall it?

Regards,
Maor




On 01/29/2014 02:50 AM, Trey Dockendorf wrote:
 See attached.  The event seems to have begun around 06:00:00 on
 2014-01-26.  I was unable to get the single node cluster back online
 so I provisioned another node to add to the cluster, which became the
 SPM.  Adding the second node worked and I had to power cycle the node
 that hung as sanlock was in a zombie state.  This is my first attempt
 at production use of NFS over RDMA and I'd like to rule out that being
 the cause.  Since the issue I've changed the 'nfs_mount_options' in
 /etc/vdsm/vdsm.conf to 'soft,nosharecache,rdma,port=20049'.  The
 options during the crash were only 'rdma,port=20049'.  I am also
 forcing NFSv3 by setting 'Nfsvers=3' in /etc/nfsmount.conf, which is
 still in place and was in place during the crash.
 
 Thanks
 - Trey
 
 On Tue, Jan 28, 2014 at 2:45 AM, Maor Lipchuk mlipc...@redhat.com wrote:
 Hi Trey,

 Can you please also attach the engine/vdsm logs.

 Thanks,
 Maor

 On 01/27/2014 06:12 PM, Trey Dockendorf wrote:
 I setup my first oVirt instance since 3.0 a few days ago and it went
 very well, and I left the single host cluster running with 1 VM over
 the weekend.  Today I come back and the primary data storage is marked
 as unresponsive.  The logs are full of entries [1] that look very
 similar to a knowledge base article on RHEL's website [2].

 This setup is using NFS over RDMA and so far the ib interfaces report
 no errors (via `ibcheckerrs -v LID 1`).  Based on a doc on ovirt
 site [3] it seems this could be due to response problems.  The storage
 system is a new purchase and not yet in production so if there's any
 advice on how to track down the cause that would be very helpful.
 Please let me know what additional information would be helpful as
 it's been about a year since I've been active in the oVirt community.

 Thanks
 - Trey

 [1]: http://pastebin.com/yRpSLKxJ

 [2]: https://access.redhat.com/site/solutions/400463

 [3]: http://www.ovirt.org/SANLock
 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users


 

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] Storage unresponsive after sanlock

2014-01-29 Thread Trey Dockendorf
On Wed, Jan 29, 2014 at 4:33 AM, Maor Lipchuk mlipc...@redhat.com wrote:
 The VDSM log seems to be from the 26th and from the engine logs it seems
 that the incident occurred at the 24th, so I can't really see the what
 happened in VDSM that time.

 From the engine logs it seems that at around 2014-01-24 16:59 the master
 storage domain was in maintenance and then there was an attempt to
 activate it, but VDSM threw an exception that it cannot find master
 domain with the arguments of
 spUUID=5849b030-626e-47cb-ad90-3ce782d831b3,
 msdUUID=7c49750d-7eae-4cd2-9b63-1dc71f357b88'

 This could be happen from various reasons, for example a failure in
 connecting the storage (for example see https://bugzilla.redhat.com/782864)


Some errors on my part that occurred before the sanlock issue were
having all the NFS exports with same fsid, as well as initial
failures to correctly pass custom NFS options to VDSM.  The sanlock
issue was not present as late as 18:00 on 2014-01-24 as I was still
working in the web interface at that time and saw no issues.

 Since you mentioned that once you have added a second node and it
 worked, it seems like to origin of the problem is in the Host it self.

 what are the differences between the two hosts (VDSM version, OS version)

There should be no differences.  They are identical hardware and
provisioned and configured using Puppet.

* vdsm-4.13.3-2.el6.x86_64
* OS is CentOS 6.5 - 2.6.32-431.3.1.el6.x86_64

 Does the first host succeeded to work on other DC?

I only have the default DC defined.  Would it be worth setting up
another DC for the sake of troubleshooting?

 Have you tried to reinstall it?

Not yet.  The install processes is automated as well as the
configuration, so whatever issues I'm running into SHOULD be present
upon re-install.  If there is a possibility a fresh install could
somehow fix this, I can re-provision.

I just noticed the 2nd host (vm02) added to the default cluster has
become Non Operational and the VM on that host failed to migrate to
the 1st host (vm01) which became SPM and is marked as Up.  The logs
on vm02 are full of sanlock messages.  What concerns me is the VM I
have running for testing is non responsive and vm01 shows messages
such as Time out during operation: cannot acquire state change lock.

I can't yet pinpoint when the failure occurred and to avoid sending 3
days worth of logs from 3 hosts I'll reset everything and try to
reproduce this with some monitoring to get a timestamp for approximate
time of failure.

Thanks
- Trey


 Regards,
 Maor




 On 01/29/2014 02:50 AM, Trey Dockendorf wrote:
 See attached.  The event seems to have begun around 06:00:00 on
 2014-01-26.  I was unable to get the single node cluster back online
 so I provisioned another node to add to the cluster, which became the
 SPM.  Adding the second node worked and I had to power cycle the node
 that hung as sanlock was in a zombie state.  This is my first attempt
 at production use of NFS over RDMA and I'd like to rule out that being
 the cause.  Since the issue I've changed the 'nfs_mount_options' in
 /etc/vdsm/vdsm.conf to 'soft,nosharecache,rdma,port=20049'.  The
 options during the crash were only 'rdma,port=20049'.  I am also
 forcing NFSv3 by setting 'Nfsvers=3' in /etc/nfsmount.conf, which is
 still in place and was in place during the crash.

 Thanks
 - Trey

 On Tue, Jan 28, 2014 at 2:45 AM, Maor Lipchuk mlipc...@redhat.com wrote:
 Hi Trey,

 Can you please also attach the engine/vdsm logs.

 Thanks,
 Maor

 On 01/27/2014 06:12 PM, Trey Dockendorf wrote:
 I setup my first oVirt instance since 3.0 a few days ago and it went
 very well, and I left the single host cluster running with 1 VM over
 the weekend.  Today I come back and the primary data storage is marked
 as unresponsive.  The logs are full of entries [1] that look very
 similar to a knowledge base article on RHEL's website [2].

 This setup is using NFS over RDMA and so far the ib interfaces report
 no errors (via `ibcheckerrs -v LID 1`).  Based on a doc on ovirt
 site [3] it seems this could be due to response problems.  The storage
 system is a new purchase and not yet in production so if there's any
 advice on how to track down the cause that would be very helpful.
 Please let me know what additional information would be helpful as
 it's been about a year since I've been active in the oVirt community.

 Thanks
 - Trey

 [1]: http://pastebin.com/yRpSLKxJ

 [2]: https://access.redhat.com/site/solutions/400463

 [3]: http://www.ovirt.org/SANLock
 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users




___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] Storage unresponsive after sanlock

2014-01-29 Thread Trey Dockendorf
On Wed, Jan 29, 2014 at 4:33 AM, Maor Lipchuk mlipc...@redhat.com wrote:
 The VDSM log seems to be from the 26th and from the engine logs it seems
 that the incident occurred at the 24th, so I can't really see the what
 happened in VDSM that time.

 From the engine logs it seems that at around 2014-01-24 16:59 the master
 storage domain was in maintenance and then there was an attempt to
 activate it, but VDSM threw an exception that it cannot find master
 domain with the arguments of
 spUUID=5849b030-626e-47cb-ad90-3ce782d831b3,
 msdUUID=7c49750d-7eae-4cd2-9b63-1dc71f357b88'


The actual error was higher in the logs after I tried activating this
host. Puppet had removed the unmanaged /etc/sudoers.d/50_vdsm file and
that was preventing vdsm from being able to execute any mount
commands.  The issues with vm02 are likely all due to that mistake on
my part.  My apologies.

- Trey

 This could be happen from various reasons, for example a failure in
 connecting the storage (for example see https://bugzilla.redhat.com/782864)

 Since you mentioned that once you have added a second node and it
 worked, it seems like to origin of the problem is in the Host it self.

 what are the differences between the two hosts (VDSM version, OS version)
 Does the first host succeeded to work on other DC?
 Have you tried to reinstall it?

 Regards,
 Maor




 On 01/29/2014 02:50 AM, Trey Dockendorf wrote:
 See attached.  The event seems to have begun around 06:00:00 on
 2014-01-26.  I was unable to get the single node cluster back online
 so I provisioned another node to add to the cluster, which became the
 SPM.  Adding the second node worked and I had to power cycle the node
 that hung as sanlock was in a zombie state.  This is my first attempt
 at production use of NFS over RDMA and I'd like to rule out that being
 the cause.  Since the issue I've changed the 'nfs_mount_options' in
 /etc/vdsm/vdsm.conf to 'soft,nosharecache,rdma,port=20049'.  The
 options during the crash were only 'rdma,port=20049'.  I am also
 forcing NFSv3 by setting 'Nfsvers=3' in /etc/nfsmount.conf, which is
 still in place and was in place during the crash.

 Thanks
 - Trey

 On Tue, Jan 28, 2014 at 2:45 AM, Maor Lipchuk mlipc...@redhat.com wrote:
 Hi Trey,

 Can you please also attach the engine/vdsm logs.

 Thanks,
 Maor

 On 01/27/2014 06:12 PM, Trey Dockendorf wrote:
 I setup my first oVirt instance since 3.0 a few days ago and it went
 very well, and I left the single host cluster running with 1 VM over
 the weekend.  Today I come back and the primary data storage is marked
 as unresponsive.  The logs are full of entries [1] that look very
 similar to a knowledge base article on RHEL's website [2].

 This setup is using NFS over RDMA and so far the ib interfaces report
 no errors (via `ibcheckerrs -v LID 1`).  Based on a doc on ovirt
 site [3] it seems this could be due to response problems.  The storage
 system is a new purchase and not yet in production so if there's any
 advice on how to track down the cause that would be very helpful.
 Please let me know what additional information would be helpful as
 it's been about a year since I've been active in the oVirt community.

 Thanks
 - Trey

 [1]: http://pastebin.com/yRpSLKxJ

 [2]: https://access.redhat.com/site/solutions/400463

 [3]: http://www.ovirt.org/SANLock
 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users




___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] Storage unresponsive after sanlock

2014-01-28 Thread Maor Lipchuk
Hi Trey,

Can you please also attach the engine/vdsm logs.

Thanks,
Maor

On 01/27/2014 06:12 PM, Trey Dockendorf wrote:
 I setup my first oVirt instance since 3.0 a few days ago and it went
 very well, and I left the single host cluster running with 1 VM over
 the weekend.  Today I come back and the primary data storage is marked
 as unresponsive.  The logs are full of entries [1] that look very
 similar to a knowledge base article on RHEL's website [2].
 
 This setup is using NFS over RDMA and so far the ib interfaces report
 no errors (via `ibcheckerrs -v LID 1`).  Based on a doc on ovirt
 site [3] it seems this could be due to response problems.  The storage
 system is a new purchase and not yet in production so if there's any
 advice on how to track down the cause that would be very helpful.
 Please let me know what additional information would be helpful as
 it's been about a year since I've been active in the oVirt community.
 
 Thanks
 - Trey
 
 [1]: http://pastebin.com/yRpSLKxJ
 
 [2]: https://access.redhat.com/site/solutions/400463
 
 [3]: http://www.ovirt.org/SANLock
 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users
 

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[Users] Storage unresponsive after sanlock

2014-01-27 Thread Trey Dockendorf
I setup my first oVirt instance since 3.0 a few days ago and it went
very well, and I left the single host cluster running with 1 VM over
the weekend.  Today I come back and the primary data storage is marked
as unresponsive.  The logs are full of entries [1] that look very
similar to a knowledge base article on RHEL's website [2].

This setup is using NFS over RDMA and so far the ib interfaces report
no errors (via `ibcheckerrs -v LID 1`).  Based on a doc on ovirt
site [3] it seems this could be due to response problems.  The storage
system is a new purchase and not yet in production so if there's any
advice on how to track down the cause that would be very helpful.
Please let me know what additional information would be helpful as
it's been about a year since I've been active in the oVirt community.

Thanks
- Trey

[1]: http://pastebin.com/yRpSLKxJ

[2]: https://access.redhat.com/site/solutions/400463

[3]: http://www.ovirt.org/SANLock
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users