One other interesting fact is that each node has 4 NFS mountpoints. 2 (data & export) to the main SAN, 1 to the engine machine for ISO and one to the legacy SAN.
When this issue occurs, the only mountpoint in a problem state seem to be the 2 mountpoints to the main SAN: 2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5 2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5 On Tue, Feb 18, 2014 at 3:04 PM, Johan Kooijman <m...@johankooijman.com>wrote: > Ok, will do. The process_pool_max_slots_per_domain is not defined, default > node values. > > > On Tue, Feb 18, 2014 at 2:56 PM, Meital Bourvine <mbour...@redhat.com>wrote: > >> Hi Johan, >> >> Can you please run something like this on the spm node? >> while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` >> >> /tmp/handler_num.txt; sleep 1; done >> >> When it'll happen again, please stop the script, and write here the >> maximum number and the time that it happened. >> >> Also, please check if "process_pool_max_slots_per_domain" is defined in >> /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined >> there, the default is 10) >> >> Thanks! >> >> >> ------------------------------ >> >> *From: *"Johan Kooijman" <m...@johankooijman.com> >> *To: *"Meital Bourvine" <mbour...@redhat.com> >> *Cc: *"users" <users@ovirt.org> >> *Sent: *Tuesday, February 18, 2014 2:55:11 PM >> *Subject: *Re: [Users] Nodes lose storage at random >> >> >> To follow up on this: The setup has only ~80 VM's active right now. The 2 >> bugreports are not in scope for this setup, the issues occur at random, >> even when there's no activity (create/delete VM's) and there are only 4 >> directories in /rhev/data-center/mnt/. >> >> >> >> On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman >> <m...@johankooijman.com>wrote: >> >>> Meital, >>> >>> I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I >>> use the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6". >>> >>> I have no ways of reproducing just yet. I can confirm that it's >>> happening on all nodes in the cluster. And every time a node goes offline, >>> this error pops up. >>> >>> Could the fact that lockd & statd were not running on the NFS host cause >>> this error? Is there a workaround available that we know of? >>> >>> >>> On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine >>> <mbour...@redhat.com>wrote: >>> >>>> Hi Johan, >>>> >>>> Please take a look at this error (from vdsm.log): >>>> >>>> >>>> Thread-636938::DEBUG::2014-02-18 >>>> 10:48:06,374::task::579::TaskManager.Task::(_updateState) >>>> Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -> >>>> state preparing >>>> Thread-636938::INFO::2014-02-18 >>>> 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: >>>> getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', >>>> spUUID='59980e09-b329-4254-b66e-790abd69e194', >>>> imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', >>>> volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) >>>> Thread-636938::ERROR::2014-02-18 >>>> 10:48:06,376::task::850::TaskManager.Task::(_setError) >>>> Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error >>>> Thread-636938::DEBUG::2014-02-18 >>>> 10:48:06,415::task::869::TaskManager.Task::(_run) >>>> Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: >>>> f4ce9a6e-0292-4071-9a24-a8d8fba7222b >>>> ('e9f70496-f181-4c9b-9ecb-d7f780772b04', >>>> '59980e09-b329-4254-b66e-790abd69e194', >>>> 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', >>>> '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task >>>> Thread-636938::DEBUG::2014-02-18 >>>> 10:48:06,416::task::1194::TaskManager.Task::(stop) >>>> Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing >>>> (force False) >>>> Thread-636938::DEBUG::2014-02-18 >>>> 10:48:06,416::task::974::TaskManager.Task::(_decref) >>>> Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True >>>> Thread-636938::INFO::2014-02-18 >>>> 10:48:06,416::task::1151::TaskManager.Task::(prepare) >>>> Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: >>>> u'No free file handlers in pool' - code 100 >>>> Thread-636938::DEBUG::2014-02-18 >>>> 10:48:06,417::task::1156::TaskManager.Task::(prepare) >>>> Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free >>>> file handlers in pool >>>> >>>> >>>> >>>> And then you can see after a few seconds: >>>> >>>> MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: >>>> 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan >>>> (2.6.32-358.18.1.el6.x86_64) >>>> >>>> >>>> >>>> >>>> Meaning that vdsm was restarted. >>>> >>>> Which oVirt version are you using? >>>> I see that there are a few old bugs that describes the same behaviour, but >>>> with different reproduction steps, for example [1], [2]. >>>> >>>> >>>> >>>> Can you think of any reproduction steps that might be causing this issue? >>>> >>>> >>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210 >>>> >>>> >>>> >>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011 >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> *From: *"Johan Kooijman" <m...@johankooijman.com> >>>> *To: *"users" <users@ovirt.org> >>>> *Sent: *Tuesday, February 18, 2014 1:32:56 PM >>>> *Subject: *[Users] Nodes lose storage at random >>>> >>>> >>>> Hi All, >>>> >>>> We're seeing some weird issues in our ovirt setup. We have 4 nodes >>>> connected and an NFS (v3) filestore (FreeBSD/ZFS). >>>> >>>> Once in a while, it seems at random, a node loses their connection to >>>> storage, recovers it a minute later. The other nodes usually don't lose >>>> their storage at that moment. Just one, or two at a time. >>>> >>>> We've setup extra tooling to verify the storage performance at those >>>> moments and the availability for other systems. It's always online, just >>>> the nodes don't think so. >>>> >>>> The engine tells me this: >>>> >>>> 2014-02-18 11:48:03,598 WARN >>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] >>>> (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in >>>> problem. vds: hv5 >>>> 2014-02-18 11:48:18,909 WARN >>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] >>>> (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in >>>> problem. vds: hv5 >>>> 2014-02-18 11:48:45,021 WARN >>>> [org.ovirt.engine.core.vdsbroker.VdsManager] >>>> (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = >>>> 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing. >>>> 2014-02-18 11:48:45,070 INFO >>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>> (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, >>>> Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data >>>> Center GS. Setting Data Center status to Non Responsive (On host hv5, >>>> Error: Network error during communication with the Host.). >>>> >>>> The export and data domain live over NFS. There's another domain, ISO, >>>> that lives on the engine machine, also shared over NFS. That domain doesn't >>>> have any issue at all. >>>> >>>> Attached are the logfiles for the relevant time period for both the >>>> engine server and the node. The node by the way, is a deployment of the >>>> node ISO, not a full blown installation. >>>> >>>> Any clues on where to begin searching? The NFS server shows no issues >>>> nor anything in the logs. I did notice that the statd and lockd daemons >>>> were not running, but I wonder if that can have anything to do with the >>>> issue. >>>> >>>> -- >>>> Met vriendelijke groeten / With kind regards, >>>> Johan Kooijman >>>> >>>> m...@johankooijman.com >>>> >>>> _______________________________________________ >>>> Users mailing list >>>> Users@ovirt.org >>>> http://lists.ovirt.org/mailman/listinfo/users >>>> >>>> >>>> >>> >>> >>> -- >>> Met vriendelijke groeten / With kind regards, >>> Johan Kooijman >>> >>> T +31(0) 6 43 44 45 27 >>> F +31(0) 162 82 00 01 >>> E m...@johankooijman.com >>> >> >> >> >> -- >> Met vriendelijke groeten / With kind regards, >> Johan Kooijman >> >> T +31(0) 6 43 44 45 27 >> F +31(0) 162 82 00 01 >> E m...@johankooijman.com >> >> >> > > > -- > Met vriendelijke groeten / With kind regards, > Johan Kooijman > > T +31(0) 6 43 44 45 27 > F +31(0) 162 82 00 01 > E m...@johankooijman.com > -- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E m...@johankooijman.com
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users