Do the device nodes actually exist on the nodes? You may need to run
nvidia-smi to create them.

On 5 October 2012 11:31, Alfonso Pardo <[email protected]> wrote:

> Yes, I have defined the gres.conf with:
>
> ##gres.conf
> Name=gpu File=/dev/nvidia[0-1]
>
>
> I have two Nvidia devices per node
>
>
> On 05/10/12 11:55, [email protected] wrote:
>
> see error. read "man gres.conf". "File" defined?
> --
> Sent from my Android phone. Please excuse my brevity and typos.
>
>
> Alfonso Pardo <[email protected]> <[email protected]> wrote:
>>
>> Activating the DEBUGFLAG=gres I have got the next error:
>>
>> [2012-10-05T08:22:44] error: gres_plugin_node_config_unpack: gres/gpu
>> lacks File parameter for node bc-p10-01
>> [2012-10-05T08:22:44] gres/gpu: state for bc-p10-01
>> [2012-10-05T08:22:44] error: Setting node bc-p10-01 state to DOWN
>> [2012-10-05T08:22:44] debug2: inserting bc-p10-01(cluster) with 8 cpus
>> [2012-10-05T08:22:44] error: _slurm_rpc_node_registration node=bc-p10-01:
>> Invalid argument
>>
>>
>>
>>
>> On 05/10/12 08:20, Alfonso Pardo wrote:
>>
>> Thanks!
>>
>> I will activate the DegugFlag with "gres" value, and I will  wacth logs
>>
>>
>>
>> On 04/10/12 18:00, Moe Jette wrote:
>>
>> All that I can think of is the slurmd daemon was unable to read the
>> gres.conf file when starting. You could add to the slurm.conf
>> "DebugFlags=gres" for more information about gres.
>>
>> Quoting Alfonso Pardo <[email protected]> <[email protected]>:
>>
>>
>> Hello,
>>
>> I have a cluster with GPU resources. The cluster works correctly,
>> but sometimes fall nodes showing the following error: "gres/gpu
>> count too low"
>>
>>
>> NodeName=bc-p10-01 Arch=x86_64 CoresPerSocket=4
>>    CPUAlloc=0 CPUErr=0 CPUTot=8 Features=(null)
>>    Gres=gpu:2
>>    NodeAddr=bc-p10-01 NodeHostName=bc-p10-01
>>    OS=Linux RealMemory=1 Sockets=2
>>    State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1
>>    BootTime=2012-07-30T12:25:31 SlurmdStartTime=2012-07-31T08:16:03
>>    Reason=gres/gpu count too low
>>
>>
>> Any suggestions?
>>
>>
>>
>> --
>>
>> /Alfonso Pardo Díaz
>> *Researcher / System Administrator at CETA-Ciemat*
>> c/ Sola nº 1; 10200 Trujillo, ESPAÑA
>> Tel: +34 927 65 93 17 Fax: +34 927 32 32 37
>> CETA-Ciemat logo <http://www.ceta-ciemat.es/> <http://www.ceta-ciemat.es/>/
>>
>>
>> ----------------------------
>> Confidencialidad: Este mensaje y sus ficheros adjuntos se dirige
>> exclusivamente a su destinatario y puede contener información
>> privilegiada o confidencial. Si no es vd. el destinatario indicado,
>> queda notificado de que la utilización, divulgación y/o copia sin
>> autorización está prohibida en virtud de la legislación vigente. Si
>> ha recibido este mensaje por error, le rogamos que nos lo comunique
>> inmediatamente respondiendo al mensaje y proceda a su destrucción.
>>
>> Disclaimer: This message and its attached files is intended
>> exclusively for its recipients and may contain confidential
>> information. If you received this e-mail in error you are hereby
>> notified that any dissemination, copy or disclosure of this
>> communication is strictly prohibited and may be unlawful. In this
>> case, please notify us by a reply and delete this email and its
>> contents immediately. ----------------------------
>>
>>
>>
>>
>>
>> --
>>
>> *Alfonso Pardo Díaz
>> Researcher / System Administrator at CETA-Ciemat
>> c/ Sola nº 1; 10200 Trujillo, ESPAÑA
>> Tel: +34 927 65 93 17 Fax: +34 927 32 32 37
>> [image: CETA-Ciemat logo] <http://www.ceta-ciemat.es/>*
>>
>>
>>
>> --
>>
>> *Alfonso Pardo Díaz
>> Researcher / System Administrator at CETA-Ciemat
>> c/ Sola nº 1; 10200 Trujillo, ESPAÑA
>> Tel: +34 927 65 93 17 Fax: +34 927 32 32 37
>> [image: CETA-Ciemat logo] <http://www.ceta-ciemat.es/>*
>>
>
>
> --
>
> *Alfonso Pardo Díaz
> Researcher / System Administrator at CETA-Ciemat
> c/ Sola nº 1; 10200 Trujillo, ESPAÑA
> Tel: +34 927 65 93 17 Fax: +34 927 32 32 37
> [image: CETA-Ciemat logo] <http://www.ceta-ciemat.es/>*
>

Reply via email to