Do the device nodes actually exist on the nodes? You may need to run nvidia-smi to create them.
On 5 October 2012 11:31, Alfonso Pardo <[email protected]> wrote: > Yes, I have defined the gres.conf with: > > ##gres.conf > Name=gpu File=/dev/nvidia[0-1] > > > I have two Nvidia devices per node > > > On 05/10/12 11:55, [email protected] wrote: > > see error. read "man gres.conf". "File" defined? > -- > Sent from my Android phone. Please excuse my brevity and typos. > > > Alfonso Pardo <[email protected]> <[email protected]> wrote: >> >> Activating the DEBUGFLAG=gres I have got the next error: >> >> [2012-10-05T08:22:44] error: gres_plugin_node_config_unpack: gres/gpu >> lacks File parameter for node bc-p10-01 >> [2012-10-05T08:22:44] gres/gpu: state for bc-p10-01 >> [2012-10-05T08:22:44] error: Setting node bc-p10-01 state to DOWN >> [2012-10-05T08:22:44] debug2: inserting bc-p10-01(cluster) with 8 cpus >> [2012-10-05T08:22:44] error: _slurm_rpc_node_registration node=bc-p10-01: >> Invalid argument >> >> >> >> >> On 05/10/12 08:20, Alfonso Pardo wrote: >> >> Thanks! >> >> I will activate the DegugFlag with "gres" value, and I will wacth logs >> >> >> >> On 04/10/12 18:00, Moe Jette wrote: >> >> All that I can think of is the slurmd daemon was unable to read the >> gres.conf file when starting. You could add to the slurm.conf >> "DebugFlags=gres" for more information about gres. >> >> Quoting Alfonso Pardo <[email protected]> <[email protected]>: >> >> >> Hello, >> >> I have a cluster with GPU resources. The cluster works correctly, >> but sometimes fall nodes showing the following error: "gres/gpu >> count too low" >> >> >> NodeName=bc-p10-01 Arch=x86_64 CoresPerSocket=4 >> CPUAlloc=0 CPUErr=0 CPUTot=8 Features=(null) >> Gres=gpu:2 >> NodeAddr=bc-p10-01 NodeHostName=bc-p10-01 >> OS=Linux RealMemory=1 Sockets=2 >> State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 >> BootTime=2012-07-30T12:25:31 SlurmdStartTime=2012-07-31T08:16:03 >> Reason=gres/gpu count too low >> >> >> Any suggestions? >> >> >> >> -- >> >> /Alfonso Pardo Díaz >> *Researcher / System Administrator at CETA-Ciemat* >> c/ Sola nº 1; 10200 Trujillo, ESPAÑA >> Tel: +34 927 65 93 17 Fax: +34 927 32 32 37 >> CETA-Ciemat logo <http://www.ceta-ciemat.es/> <http://www.ceta-ciemat.es/>/ >> >> >> ---------------------------- >> Confidencialidad: Este mensaje y sus ficheros adjuntos se dirige >> exclusivamente a su destinatario y puede contener información >> privilegiada o confidencial. Si no es vd. el destinatario indicado, >> queda notificado de que la utilización, divulgación y/o copia sin >> autorización está prohibida en virtud de la legislación vigente. Si >> ha recibido este mensaje por error, le rogamos que nos lo comunique >> inmediatamente respondiendo al mensaje y proceda a su destrucción. >> >> Disclaimer: This message and its attached files is intended >> exclusively for its recipients and may contain confidential >> information. If you received this e-mail in error you are hereby >> notified that any dissemination, copy or disclosure of this >> communication is strictly prohibited and may be unlawful. In this >> case, please notify us by a reply and delete this email and its >> contents immediately. ---------------------------- >> >> >> >> >> >> -- >> >> *Alfonso Pardo Díaz >> Researcher / System Administrator at CETA-Ciemat >> c/ Sola nº 1; 10200 Trujillo, ESPAÑA >> Tel: +34 927 65 93 17 Fax: +34 927 32 32 37 >> [image: CETA-Ciemat logo] <http://www.ceta-ciemat.es/>* >> >> >> >> -- >> >> *Alfonso Pardo Díaz >> Researcher / System Administrator at CETA-Ciemat >> c/ Sola nº 1; 10200 Trujillo, ESPAÑA >> Tel: +34 927 65 93 17 Fax: +34 927 32 32 37 >> [image: CETA-Ciemat logo] <http://www.ceta-ciemat.es/>* >> > > > -- > > *Alfonso Pardo Díaz > Researcher / System Administrator at CETA-Ciemat > c/ Sola nº 1; 10200 Trujillo, ESPAÑA > Tel: +34 927 65 93 17 Fax: +34 927 32 32 37 > [image: CETA-Ciemat logo] <http://www.ceta-ciemat.es/>* >
