My best guess is the gres.conf file did not exist when the slurmd started. That should produce an error like this in the SlurmdLogFile:
slurmd: error: can't stat gres.conf file <your_path_here>/gres.conf, assuming zero resource counts If the file listed in gres.conf does not exist, the slurmd will log a fatal error and exit. Moe Quoting Alfonso Pardo <[email protected]>: > yes, the devices are created on the nodes > > #> /etc/init.d/nvidia-smi status > Compute mode is already set to EXCLUSIVE_THREAD for GPU 0000:0C:00.0. > Compute mode is already set to EXCLUSIVE_THREAD for GPU 0000:0A:00.0. > > The cluster work fine, but some times the nodes fail with "down > state" and I get the gres/gpu error > > On 05/10/12 13:05, James Sharpe wrote: >> Do the device nodes actually exist on the nodes? You may need to >> run nvidia-smi to create them. >> >> On 5 October 2012 11:31, Alfonso Pardo <[email protected] >> <mailto:[email protected]>> wrote: >> >> Yes, I have defined the gres.conf with: >> >> ##gres.conf >> Name=gpu File=/dev/nvidia[0-1] >> >> >> I have two Nvidia devices per node >> >> >> On 05/10/12 11:55, [email protected] <mailto:[email protected]> >> wrote: >>> see error. read "man gres.conf". "File" defined? >>> -- >>> Sent from my Android phone. Please excuse my brevity and typos. >>> >>> >>> Alfonso Pardo <[email protected]> >>> <mailto:[email protected]> wrote: >>> >>> Activating the DEBUGFLAG=gres I have got the next error: >>> >>> [2012-10-05T08:22:44] error: gres_plugin_node_config_unpack: >>> gres/gpu lacks File parameter for node bc-p10-01 >>> [2012-10-05T08:22:44] gres/gpu: state for bc-p10-01 >>> [2012-10-05T08:22:44] error: Setting node bc-p10-01 state to DOWN >>> [2012-10-05T08:22:44] debug2: inserting bc-p10-01(cluster) >>> with 8 cpus >>> [2012-10-05T08:22:44] error: _slurm_rpc_node_registration >>> node=bc-p10-01: Invalid argument >>> >>> >>> >>> >>> On 05/10/12 08:20, Alfonso Pardo wrote: >>>> Thanks! >>>> >>>> I will activate the DegugFlag with "gres" value, and I >>>> will wacth logs >>>> >>>> >>>> >>>> On 04/10/12 18:00, Moe Jette wrote: >>>>> All that I can think of is the slurmd daemon was unable >>>>> to read the >>>>> gres.conf file when starting. You could add to the slurm.conf >>>>> "DebugFlags=gres" for more information about gres. >>>>> >>>>> Quoting Alfonso Pardo<[email protected]> >>>>> <mailto:[email protected]>: >>>>> >>>>>> Hello, >>>>>> >>>>>> I have a cluster with GPU resources. The cluster works correctly, >>>>>> but sometimes fall nodes showing the following error: "gres/gpu >>>>>> count too low" >>>>>> >>>>>> >>>>>> NodeName=bc-p10-01 Arch=x86_64 CoresPerSocket=4 >>>>>> CPUAlloc=0 CPUErr=0 CPUTot=8 Features=(null) >>>>>> Gres=gpu:2 >>>>>> NodeAddr=bc-p10-01 NodeHostName=bc-p10-01 >>>>>> OS=Linux RealMemory=1 Sockets=2 >>>>>> State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 >>>>>> BootTime=2012-07-30T12:25:31 >>>>>> SlurmdStartTime=2012-07-31T08:16:03 >>>>>> Reason=gres/gpu count too low >>>>>> >>>>>> >>>>>> Any suggestions? >>>>>> >>>>>> >>>>>> >>>>>> -- /Alfonso Pardo Díaz >>>>>> *Researcher / System Administrator at CETA-Ciemat* >>>>>> c/ Sola nº 1; 10200 Trujillo, ESPAÑA >>>>>> Tel: +34 927 65 93 17 Fax: +34 927 32 32 37 >>>>>> CETA-Ciemat logo<http://www.ceta-ciemat.es/>/ >>>>>> >>>>>> >>>>>> ---------------------------- >>>>>> Confidencialidad: Este mensaje y sus ficheros adjuntos se dirige >>>>>> exclusivamente a su destinatario y puede contener información >>>>>> privilegiada o confidencial. Si no es vd. el >>>>>> destinatario indicado, >>>>>> queda notificado de que la utilización, divulgación y/o copia sin >>>>>> autorización está prohibida en virtud de la legislación >>>>>> vigente. Si >>>>>> ha recibido este mensaje por error, le rogamos que nos >>>>>> lo comunique >>>>>> inmediatamente respondiendo al mensaje y proceda a su >>>>>> destrucción. >>>>>> >>>>>> Disclaimer: This message and its attached files is intended >>>>>> exclusively for its recipients and may contain confidential >>>>>> information. If you received this e-mail in error you are hereby >>>>>> notified that any dissemination, copy or disclosure of this >>>>>> communication is strictly prohibited and may be unlawful. In this >>>>>> case, please notify us by a reply and delete this email and its >>>>>> contents immediately. ---------------------------- >>>>>> >>>>>> >>>> >>>> >>>> -- /Alfonso Pardo Díaz >>>> *Researcher / System Administrator at CETA-Ciemat* >>>> c/ Sola nº 1; 10200 Trujillo, ESPAÑA >>>> Tel: +34 927 65 93 17 <tel:%2B34%20927%2065%2093%2017> Fax: >>>> +34 927 32 32 37 <tel:%2B34%20927%2032%2032%2037> >>>> CETA-Ciemat logo <http://www.ceta-ciemat.es/>/ >>>> >>> >>> >>> -- /Alfonso Pardo Díaz >>> *Researcher / System Administrator at CETA-Ciemat* >>> c/ Sola nº 1; 10200 Trujillo, ESPAÑA >>> Tel: +34 927 65 93 17 <tel:%2B34%20927%2065%2093%2017> Fax: >>> +34 927 32 32 37 <tel:%2B34%20927%2032%2032%2037> >>> CETA-Ciemat logo <http://www.ceta-ciemat.es/>/ >>> >> >> >> -- /Alfonso Pardo Díaz >> *Researcher / System Administrator at CETA-Ciemat* >> c/ Sola nº 1; 10200 Trujillo, ESPAÑA >> Tel: +34 927 65 93 17 <tel:%2B34%20927%2065%2093%2017> Fax: +34 >> 927 32 32 37 <tel:%2B34%20927%2032%2032%2037> >> CETA-Ciemat logo <http://www.ceta-ciemat.es/>/ >> >> > > > -- > > /Alfonso Pardo Díaz > *Researcher / System Administrator at CETA-Ciemat* > c/ Sola nº 1; 10200 Trujillo, ESPAÑA > Tel: +34 927 65 93 17 Fax: +34 927 32 32 37 > CETA-Ciemat logo <http://www.ceta-ciemat.es/>/ > >
