Oh, sorry.

Normally I copy slurm.conf to the nodes, then restart slurmd via the init
script. I do not call scontrol. The slurmd process is terminated and
restarted.

I don't remember if I did that procedure when experiencing the bug. Now
trying to reproduce it I can't... I just modified slurm.conf on the master
to put back the "Sockets=2 CoresPerSocket=4 ThreadsPerCore=2" for certain
nodes, create a new partition for them, restart slurmctl and submit a job.
Now it doesn't crash anymore... I know I have tried CR_Core, so maybe the
nodes had CR_CPU while master had CR_Core. I just tried it too but it's the
same: even if I don't copy+restart on the compute nodes, no crash.

This is weird. I'll keep an eye open on that.

Thanks for your replies.

Nicolas

On Tue, Aug 2, 2011 at 7:43 PM, <[email protected]> wrote:

> No, how did you change the values?
> Did you update slurm.conf for all nodes or job some nodes?
> Did you restart slurmctld or run scontrol reconfigure?
>
>
> Quoting Nicolas Bigaouette <[email protected]>:
>
>  As it was in the first email:
>> NodeName=node[2-4] RealMemory=23000 Sockets=2 CoresPerSocket=4
>> ThreadsPerCore=2 State=UNKNOWN
>>
>>
>> On Tue, Aug 2, 2011 at 7:34 PM, <[email protected]> wrote:
>>
>>  I don't see socket/core/thread information in this slurm.conf. how
>>> exactly
>>> did you change them?
>>>
>>>
>>> Quoting Nicolas Bigaouette <[email protected]>:
>>>
>>>  Hi Danny,
>>>
>>>>
>>>> Yes of course... Here it is.
>>>>
>>>> N
>>>>
>>>> On Tue, Aug 2, 2011 at 6:52 PM, Danny Auble <[email protected]> wrote:
>>>>
>>>>  **
>>>>
>>>>>
>>>>> Hey Nicolas, could you send your complete slurm.conf? It would be
>>>>> interesting to see the other plugins you are using that may be
>>>>> contributing
>>>>> to the problem.
>>>>>
>>>>>
>>>>>
>>>>> Danny
>>>>>
>>>>>
>>>>>
>>>>> On Tuesday August 02 2011 6:43:17 PM you wrote:
>>>>>
>>>>> > Hi all,
>>>>>
>>>>> >
>>>>>
>>>>> > I'm having issues with slurm 2.2.7 and specifying the nodes cpu
>>>>> information.
>>>>>
>>>>> >
>>>>>
>>>>> > If I set the number of sockets, core per socket and thread per core
>>>>> like
>>>>>
>>>>> > this:
>>>>>
>>>>> >
>>>>>
>>>>> > > NodeName=node[2-4] RealMemory=23000 Sockets=2 CoresPerSocket=4
>>>>>
>>>>> > > ThreadsPerCore=2 State=UNKNOWN
>>>>>
>>>>> > >
>>>>>
>>>>> > >> and submit a job, slurmctl crashes. The last section of
>>>>> sclurmctl.log
>>>>> is:
>>>>>
>>>>> >
>>>>>
>>>>> > > [2011-08-02T17:58:50] debug2: initial priority for job 49852 is 98
>>>>>
>>>>> > > [2011-08-02T17:58:50] debug2: found 3 usable nodes from config
>>>>> containing
>>>>>
>>>>> > > node[2-4]
>>>>>
>>>>> > > [2011-08-02T17:58:50] debug3: _pick_best_nodes: job 49852
>>>>> idle_nodes
>>>>> 65
>>>>>
>>>>> > > share_nodes 76
>>>>>
>>>>> > > [2011-08-02T17:58:50] debug2: sched: JobId=49852 allocated
>>>>> resources:
>>>>>
>>>>> > > NodeList=(null)
>>>>>
>>>>> > > [2011-08-02T17:58:50] _slurm_rpc_submit_batch_job JobId=49852
>>>>> usec=1540
>>>>>
>>>>> > > [2011-08-02T17:58:50] debug: sched: Running job scheduler
>>>>>
>>>>> > > [2011-08-02T17:58:50] debug2: found 3 usable nodes from config
>>>>> containing
>>>>>
>>>>> > > node[2-4]
>>>>>
>>>>> > > [2011-08-02T17:58:50] debug3: _pick_best_nodes: job 49852
>>>>> idle_nodes
>>>>> 65
>>>>>
>>>>> > > share_nodes 76
>>>>>
>>>>> > > [2011-08-02T17:58:50] fatal: cons_res: sync loop not progressing
>>>>>
>>>>> > >
>>>>>
>>>>> >
>>>>>
>>>>> >
>>>>>
>>>>> > I've also seen the error "cons_res: cpus computation error".
>>>>>
>>>>> >
>>>>>
>>>>> > There might be something wrong with my configuration, but slurm
>>>>> should
>>>>> tell
>>>>>
>>>>> > me so, not crash when a job is submitted...
>>>>>
>>>>> >
>>>>>
>>>>> > I'm playing with these options because a user reported that just
>>>>> using
>>>>>
>>>>> > Procs=16 would not spread his mpi processes accross the allocated
>>>>> nodes.
>>>>>
>>>>> > I've fixed that by using --nodes=*-* and --ntasks-per-node=*, but the
>>>>> crash
>>>>>
>>>>> > is still relevant I guess...
>>>>>
>>>>> >
>>>>>
>>>>> > Could it be a bug?
>>>>>
>>>>> >
>>>>>
>>>>> > Thanks
>>>>>
>>>>> >
>>>>>
>>>>> > Nicolas
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>

Reply via email to