Thanks the DebugFlags helped in principle but the server slurmctrld restart seems to have been the fix. We'd thought a reconfigure was sufficient, and I'm pretty sure we restarted the node daemons but not the server's.
For reference, I see there is a 32bit count limit so we'll stick with memdir:64 (and so on) rather than 64G. (BTW our newer 'Remote' license issue did not get fixed at the same time) Gareth > -----Original Message----- > From: Franco Broi [mailto:[email protected]] > Sent: Wednesday, 7 January 2015 4:13 PM > To: slurm-dev > Subject: [slurm-dev] Re: gres without plugin > > > > Not sure why it's complaining about plugins, maybe the config files on > the nodes are a bit messed up? > > Final suggestion before waiting for schedmd help would be to set > DebugFlags=gres in slurm.conf and restart slurmctrld and do scontrol - > reconf > > Just to prove it does work: > > [franco@charlie1 ~]$ salloc --gres=help > Valid gres options are: > Perseus[:count] > Galaxy[:count] > SRME[:count] > > [franco@charlie1 ~]$ salloc --gres=Galaxy:3 -p d1 > salloc: Pending job allocation 98683 > salloc: job 98683 queued and waiting for resources > > > > On Tue, 2015-01-06 at 20:00 -0800, [email protected] wrote: > > > -----Original Message----- > > > From: Franco Broi [mailto:[email protected]] > > > Sent: Wednesday, 7 January 2015 11:47 AM > > > To: slurm-dev > > > Subject: [slurm-dev] Re: gres without plugin > > > > > > > > > > > > Anything in the slurmctrld log? > > > > For only a couple of nodes I'm getting messages every 5 minutes like: > > [2015-01-07T14:52:03.211] error: gres_plugin_node_config_unpack: no > > plugin configured to unpack data type memdir from node c007 > > > > Gareth > > > > > > > > > > > On Tue, 2015-01-06 at 15:03 -0800, [email protected] wrote: > > > > We're trying to follow http://slurm.schedmd.com/gres.html to > > > > schedule > > > requested use of /dev/shm using the term 'memdir' without success > so > > > far. > > > > > > > > In slurm.conf we have: > > > > GresTypes=memdir > > > > And > > > > NodeName=DEFAULT Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 > > > > RealMemory=131072 Gres=memdir:64 And it may matter that we have: > > > > FastSchedule=2 > > > > > > > > And each node has (The autogenerated bit is from Bright Cluster > > > Manager): > > > > > cat /etc/slurm/gres.conf > > > > # This section of this file was automatically generated by cmd. > Do > > > not edit manually! > > > > # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE Name=gpu Name=mic > > > > # END AUTOGENERATED SECTION -- DO NOT REMOVE > > > > Name=memdir Count=64 > > > > > > > > (we will need to vary both these later to customize the resource > > > > available on different nodes) > > > > > > > > (we'd like to try using 64G instead of 64 but just want it > working > > > > first) > > > > > > > > The resource seems to be set for a node: > > > > > scontrol show node c001 > > > > NodeName=c001 Arch=x86_64 CoresPerSocket=10 > > > > CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.08 Features=(null) > > > > Gres=memdir:64 > > > > NodeAddr=c001 NodeHostName=c001 Version=14.03.0 > > > > OS=Linux RealMemory=131072 AllocMem=0 Sockets=2 Boards=1 > > > > State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 > > > > BootTime=2014-12-23T12:03:22 SlurmdStartTime=2014-12- > 23T01:06:05 > > > > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > > > > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > > > > > > And seems to be available to use in principle: > > > > > salloc --gres=help > > > > Valid gres options are: > > > > memdir[:count] > > > > > > > > But is not useable in practice: > > > > > salloc --gres=memdir:16 > > > > salloc: error: Job submit/allocate failed: Invalid generic > > > > resource > > > > (gres) specification > > > > > > > > Can anyone see where we are going wrong? > > > > > > > > Gareth Williams > > > > > > > > ps. At some point we will also want to schedule gpus.
