My best guess is the gres.conf file did not exist when the slurmd  
started. That should produce an error like this in the SlurmdLogFile:

slurmd: error: can't stat gres.conf file <your_path_here>/gres.conf,  
assuming zero resource counts

If the file listed in gres.conf does not exist, the slurmd will log a  
fatal error and exit.

Moe


Quoting Alfonso Pardo <[email protected]>:

> yes, the devices are created on the nodes
>
> #> /etc/init.d/nvidia-smi status
> Compute mode is already set to EXCLUSIVE_THREAD for GPU 0000:0C:00.0.
> Compute mode is already set to EXCLUSIVE_THREAD for GPU 0000:0A:00.0.
>
> The cluster work fine, but some times the nodes fail with "down  
> state" and I get the gres/gpu error
>
> On 05/10/12 13:05, James Sharpe wrote:
>> Do the device nodes actually exist on the nodes? You may need to  
>> run nvidia-smi to create them.
>>
>> On 5 October 2012 11:31, Alfonso Pardo <[email protected]  
>> <mailto:[email protected]>> wrote:
>>
>>    Yes, I have defined the gres.conf with:
>>
>>    ##gres.conf
>>    Name=gpu File=/dev/nvidia[0-1]
>>
>>
>>    I have two Nvidia devices per node
>>
>>
>>    On 05/10/12 11:55, [email protected] <mailto:[email protected]>
>>    wrote:
>>>    see error. read "man gres.conf". "File" defined?
>>>    --
>>>    Sent from my Android phone. Please excuse my brevity and typos.
>>>
>>>
>>>    Alfonso Pardo <[email protected]>
>>>    <mailto:[email protected]> wrote:
>>>
>>>        Activating the DEBUGFLAG=gres I have got the next error:
>>>
>>>        [2012-10-05T08:22:44] error: gres_plugin_node_config_unpack:
>>>        gres/gpu lacks File parameter for node bc-p10-01
>>>        [2012-10-05T08:22:44] gres/gpu: state for bc-p10-01
>>>        [2012-10-05T08:22:44] error: Setting node bc-p10-01 state to DOWN
>>>        [2012-10-05T08:22:44] debug2: inserting bc-p10-01(cluster)
>>>        with 8 cpus
>>>        [2012-10-05T08:22:44] error: _slurm_rpc_node_registration
>>>        node=bc-p10-01: Invalid argument
>>>
>>>
>>>
>>>
>>>        On 05/10/12 08:20, Alfonso Pardo wrote:
>>>>        Thanks!
>>>>
>>>>        I will activate the DegugFlag with "gres" value, and I  
>>>> will         wacth logs
>>>>
>>>>
>>>>
>>>>        On 04/10/12 18:00, Moe Jette wrote:
>>>>>        All that I can think of is the slurmd daemon was unable  
>>>>> to read the
>>>>>        gres.conf file when starting. You could add to the slurm.conf
>>>>>        "DebugFlags=gres" for more information about gres.
>>>>>
>>>>>        Quoting Alfonso Pardo<[email protected]>   
>>>>> <mailto:[email protected]>:
>>>>>
>>>>>>        Hello,
>>>>>>
>>>>>>        I have a cluster with GPU resources. The cluster works correctly,
>>>>>>        but sometimes fall nodes showing the following error: "gres/gpu
>>>>>>        count too low"
>>>>>>
>>>>>>
>>>>>>        NodeName=bc-p10-01 Arch=x86_64 CoresPerSocket=4
>>>>>>            CPUAlloc=0 CPUErr=0 CPUTot=8 Features=(null)
>>>>>>            Gres=gpu:2
>>>>>>            NodeAddr=bc-p10-01 NodeHostName=bc-p10-01
>>>>>>            OS=Linux RealMemory=1 Sockets=2
>>>>>>            State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1
>>>>>>            BootTime=2012-07-30T12:25:31  
>>>>>> SlurmdStartTime=2012-07-31T08:16:03
>>>>>>            Reason=gres/gpu count too low
>>>>>>
>>>>>>
>>>>>>        Any suggestions?
>>>>>>
>>>>>>
>>>>>>
>>>>>>        --         /Alfonso Pardo Díaz
>>>>>>        *Researcher / System Administrator at CETA-Ciemat*
>>>>>>        c/ Sola nº 1; 10200 Trujillo, ESPAÑA
>>>>>>        Tel: +34 927 65 93 17 Fax: +34 927 32 32 37
>>>>>>        CETA-Ciemat logo<http://www.ceta-ciemat.es/>/
>>>>>>
>>>>>>
>>>>>>        ----------------------------
>>>>>>        Confidencialidad: Este mensaje y sus ficheros adjuntos se dirige
>>>>>>        exclusivamente a su destinatario y puede contener información
>>>>>>        privilegiada o confidencial. Si no es vd. el  
>>>>>> destinatario indicado,
>>>>>>        queda notificado de que la utilización, divulgación y/o copia sin
>>>>>>        autorización está prohibida en virtud de la legislación  
>>>>>> vigente. Si
>>>>>>        ha recibido este mensaje por error, le rogamos que nos  
>>>>>> lo comunique
>>>>>>        inmediatamente respondiendo al mensaje y proceda a su  
>>>>>> destrucción.
>>>>>>
>>>>>>        Disclaimer: This message and its attached files is intended
>>>>>>        exclusively for its recipients and may contain confidential
>>>>>>        information. If you received this e-mail in error you are hereby
>>>>>>        notified that any dissemination, copy or disclosure of this
>>>>>>        communication is strictly prohibited and may be unlawful. In this
>>>>>>        case, please notify us by a reply and delete this email and its
>>>>>>        contents immediately. ----------------------------
>>>>>>
>>>>>>
>>>>
>>>>
>>>>        --         /Alfonso Pardo Díaz
>>>>        *Researcher / System Administrator at CETA-Ciemat*
>>>>        c/ Sola nº 1; 10200 Trujillo, ESPAÑA
>>>>        Tel: +34 927 65 93 17 <tel:%2B34%20927%2065%2093%2017> Fax:
>>>>        +34 927 32 32 37 <tel:%2B34%20927%2032%2032%2037>
>>>>        CETA-Ciemat logo <http://www.ceta-ciemat.es/>/
>>>>
>>>
>>>
>>>        --         /Alfonso Pardo Díaz
>>>        *Researcher / System Administrator at CETA-Ciemat*
>>>        c/ Sola nº 1; 10200 Trujillo, ESPAÑA
>>>        Tel: +34 927 65 93 17 <tel:%2B34%20927%2065%2093%2017> Fax:
>>>        +34 927 32 32 37 <tel:%2B34%20927%2032%2032%2037>
>>>        CETA-Ciemat logo <http://www.ceta-ciemat.es/>/
>>>
>>
>>
>>    --     /Alfonso Pardo Díaz
>>    *Researcher / System Administrator at CETA-Ciemat*
>>    c/ Sola nº 1; 10200 Trujillo, ESPAÑA
>>    Tel: +34 927 65 93 17 <tel:%2B34%20927%2065%2093%2017> Fax: +34
>>    927 32 32 37 <tel:%2B34%20927%2032%2032%2037>
>>    CETA-Ciemat logo <http://www.ceta-ciemat.es/>/
>>
>>
>
>
> -- 
>
> /Alfonso Pardo Díaz
> *Researcher / System Administrator at CETA-Ciemat*
> c/ Sola nº 1; 10200 Trujillo, ESPAÑA
> Tel: +34 927 65 93 17 Fax: +34 927 32 32 37
> CETA-Ciemat logo <http://www.ceta-ciemat.es/>/
>
>

Reply via email to