Re: [Users] Nodes lose storage at random

Johan Kooijman Tue, 18 Feb 2014 13:42:25 -0800

One other interesting fact is that each node has 4 NFS mountpoints. 2 (data
& export) to the main SAN, 1 to the engine machine for ISO and one to the
legacy SAN.


When this issue occurs, the only mountpoint in a problem state seem to be
the 2 mountpoints to the main SAN:

2014-02-18 11:48:03,598 WARN
 [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in
problem. vds: hv5
2014-02-18 11:48:18,909 WARN
 [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in
problem. vds: hv5


On Tue, Feb 18, 2014 at 3:04 PM, Johan Kooijman <m...@johankooijman.com>wrote:

> Ok, will do. The process_pool_max_slots_per_domain is not defined, default
> node values.
>
>
> On Tue, Feb 18, 2014 at 2:56 PM, Meital Bourvine <mbour...@redhat.com>wrote:
>
>> Hi Johan,
>>
>> Can you please run something like this on the spm node?
>> while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` >>
>> /tmp/handler_num.txt; sleep 1; done
>>
>> When it'll happen again, please stop the script, and write here the
>> maximum number and the time that it happened.
>>
>> Also, please check if "process_pool_max_slots_per_domain" is defined in
>> /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined
>> there, the default is 10)
>>
>> Thanks!
>>
>>
>> ------------------------------
>>
>> *From: *"Johan Kooijman" <m...@johankooijman.com>
>> *To: *"Meital Bourvine" <mbour...@redhat.com>
>> *Cc: *"users" <users@ovirt.org>
>> *Sent: *Tuesday, February 18, 2014 2:55:11 PM
>> *Subject: *Re: [Users] Nodes lose storage at random
>>
>>
>> To follow up on this: The setup has only ~80 VM's active right now. The 2
>> bugreports are not in scope for this setup, the issues occur at random,
>> even when there's no activity (create/delete VM's) and there are only 4
>> directories in /rhev/data-center/mnt/.
>>
>>
>>
>> On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman 
>> <m...@johankooijman.com>wrote:
>>
>>> Meital,
>>>
>>> I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I
>>> use the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".
>>>
>>> I have no ways of reproducing just yet. I can confirm that it's
>>> happening on all nodes in the cluster. And every time a node goes offline,
>>> this error pops up.
>>>
>>> Could the fact that lockd & statd were not running on the NFS host cause
>>> this error? Is there a workaround available that we know of?
>>>
>>>
>>> On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine 
>>> <mbour...@redhat.com>wrote:
>>>
>>>> Hi Johan,
>>>>
>>>> Please take a look at this error (from vdsm.log):
>>>>
>>>>
>>>> Thread-636938::DEBUG::2014-02-18 
>>>> 10:48:06,374::task::579::TaskManager.Task::(_updateState) 
>>>> Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -> 
>>>> state preparing
>>>> Thread-636938::INFO::2014-02-18 
>>>> 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: 
>>>> getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', 
>>>> spUUID='59980e09-b329-4254-b66e-790abd69e194', 
>>>> imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', 
>>>> volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
>>>> Thread-636938::ERROR::2014-02-18 
>>>> 10:48:06,376::task::850::TaskManager.Task::(_setError) 
>>>> Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
>>>> Thread-636938::DEBUG::2014-02-18 
>>>> 10:48:06,415::task::869::TaskManager.Task::(_run) 
>>>> Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: 
>>>> f4ce9a6e-0292-4071-9a24-a8d8fba7222b 
>>>> ('e9f70496-f181-4c9b-9ecb-d7f780772b04', 
>>>> '59980e09-b329-4254-b66e-790abd69e194', 
>>>> 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', 
>>>> '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
>>>> Thread-636938::DEBUG::2014-02-18 
>>>> 10:48:06,416::task::1194::TaskManager.Task::(stop) 
>>>> Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing 
>>>> (force False)
>>>> Thread-636938::DEBUG::2014-02-18 
>>>> 10:48:06,416::task::974::TaskManager.Task::(_decref) 
>>>> Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
>>>> Thread-636938::INFO::2014-02-18 
>>>> 10:48:06,416::task::1151::TaskManager.Task::(prepare) 
>>>> Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: 
>>>> u'No free file handlers in pool' - code 100
>>>> Thread-636938::DEBUG::2014-02-18 
>>>> 10:48:06,417::task::1156::TaskManager.Task::(prepare) 
>>>> Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free 
>>>> file handlers in pool
>>>>
>>>>
>>>>
>>>> And then you can see after a few seconds:
>>>>
>>>> MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 
>>>> 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan 
>>>> (2.6.32-358.18.1.el6.x86_64)
>>>>
>>>>
>>>>
>>>>
>>>> Meaning that vdsm was restarted.
>>>>
>>>> Which oVirt version are you using?
>>>> I see that there are a few old bugs that describes the same behaviour, but 
>>>> with different reproduction steps, for example [1], [2].
>>>>
>>>>
>>>>
>>>> Can you think of any reproduction steps that might be causing this issue?
>>>>
>>>>
>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210
>>>>
>>>>
>>>>
>>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> *From: *"Johan Kooijman" <m...@johankooijman.com>
>>>> *To: *"users" <users@ovirt.org>
>>>> *Sent: *Tuesday, February 18, 2014 1:32:56 PM
>>>> *Subject: *[Users] Nodes lose storage at random
>>>>
>>>>
>>>> Hi All,
>>>>
>>>> We're seeing some weird issues in our ovirt setup. We have 4 nodes
>>>> connected and an NFS (v3) filestore (FreeBSD/ZFS).
>>>>
>>>> Once in a while, it seems at random, a node loses their connection to
>>>> storage, recovers it a minute later. The other nodes usually don't lose
>>>> their storage at that moment. Just one, or two at a time.
>>>>
>>>> We've setup extra tooling to verify the storage performance at those
>>>> moments and the availability for other systems. It's always online, just
>>>> the nodes don't think so.
>>>>
>>>> The engine tells me this:
>>>>
>>>> 2014-02-18 11:48:03,598 WARN
>>>>  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>> (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in
>>>> problem. vds: hv5
>>>> 2014-02-18 11:48:18,909 WARN
>>>>  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>> (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in
>>>> problem. vds: hv5
>>>> 2014-02-18 11:48:45,021 WARN
>>>>  [org.ovirt.engine.core.vdsbroker.VdsManager]
>>>> (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds =
>>>> 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.
>>>> 2014-02-18 11:48:45,070 INFO
>>>>  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894,
>>>> Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data
>>>> Center GS. Setting Data Center status to Non Responsive (On host hv5,
>>>> Error: Network error during communication with the Host.).
>>>>
>>>> The export and data domain live over NFS. There's another domain, ISO,
>>>> that lives on the engine machine, also shared over NFS. That domain doesn't
>>>> have any issue at all.
>>>>
>>>>  Attached are the logfiles for the relevant time period for both the
>>>> engine server and the node. The node by the way, is a deployment of the
>>>> node ISO, not a full blown installation.
>>>>
>>>> Any clues on where to begin searching? The NFS server shows no issues
>>>> nor anything in the logs. I did notice that the statd and lockd daemons
>>>> were not running, but I wonder if that can have anything to do with the
>>>> issue.
>>>>
>>>> --
>>>> Met vriendelijke groeten / With kind regards,
>>>> Johan Kooijman
>>>>
>>>> m...@johankooijman.com
>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users@ovirt.org
>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Met vriendelijke groeten / With kind regards,
>>> Johan Kooijman
>>>
>>> T +31(0) 6 43 44 45 27
>>> F +31(0) 162 82 00 01
>>> E m...@johankooijman.com
>>>
>>
>>
>>
>> --
>> Met vriendelijke groeten / With kind regards,
>> Johan Kooijman
>>
>> T +31(0) 6 43 44 45 27
>> F +31(0) 162 82 00 01
>> E m...@johankooijman.com
>>
>>
>>
>
>
> --
> Met vriendelijke groeten / With kind regards,
> Johan Kooijman
>
> T +31(0) 6 43 44 45 27
> F +31(0) 162 82 00 01
> E m...@johankooijman.com
>



-- 
Met vriendelijke groeten / With kind regards,
Johan Kooijman

T +31(0) 6 43 44 45 27
F +31(0) 162 82 00 01
E m...@johankooijman.com

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [Users] Nodes lose storage at random

Reply via email to