Thanks the DebugFlags helped in principle but the server slurmctrld restart 
seems to have been the fix.  We'd thought a reconfigure was sufficient, and I'm 
pretty sure we restarted the node daemons but not the server's.

For reference, I see there is a 32bit count limit so we'll stick with memdir:64 
(and so on) rather than 64G.

(BTW our newer 'Remote' license issue did not get fixed at the same time)

Gareth

> -----Original Message-----
> From: Franco Broi [mailto:[email protected]]
> Sent: Wednesday, 7 January 2015 4:13 PM
> To: slurm-dev
> Subject: [slurm-dev] Re: gres without plugin
> 
> 
> 
> Not sure why it's complaining about plugins, maybe the config files on
> the nodes are a bit messed up?
> 
> Final suggestion before waiting for schedmd help would be to set
> DebugFlags=gres in slurm.conf and restart slurmctrld and do scontrol -
> reconf
> 
> Just to prove it does work:
> 
> [franco@charlie1 ~]$ salloc --gres=help
> Valid gres options are:
> Perseus[:count]
> Galaxy[:count]
> SRME[:count]
> 
> [franco@charlie1 ~]$ salloc --gres=Galaxy:3 -p d1
> salloc: Pending job allocation 98683
> salloc: job 98683 queued and waiting for resources
> 
> 
> 
> On Tue, 2015-01-06 at 20:00 -0800, [email protected] wrote:
> > > -----Original Message-----
> > > From: Franco Broi [mailto:[email protected]]
> > > Sent: Wednesday, 7 January 2015 11:47 AM
> > > To: slurm-dev
> > > Subject: [slurm-dev] Re: gres without plugin
> > >
> > >
> > >
> > > Anything in the slurmctrld log?
> >
> > For only a couple of nodes I'm getting messages every 5 minutes like:
> > [2015-01-07T14:52:03.211] error: gres_plugin_node_config_unpack: no
> > plugin configured to unpack data type memdir from node c007
> >
> > Gareth
> >
> > >
> > >
> > > On Tue, 2015-01-06 at 15:03 -0800, [email protected] wrote:
> > > > We're trying to follow http://slurm.schedmd.com/gres.html to
> > > > schedule
> > > requested use of /dev/shm using the term 'memdir' without success
> so
> > > far.
> > > >
> > > > In slurm.conf we have:
> > > >  GresTypes=memdir
> > > > And
> > > >  NodeName=DEFAULT Sockets=2 CoresPerSocket=10 ThreadsPerCore=1
> > > > RealMemory=131072 Gres=memdir:64 And it may matter that we have:
> > > >  FastSchedule=2
> > > >
> > > > And each node has (The autogenerated bit is from Bright Cluster
> > > Manager):
> > > > > cat /etc/slurm/gres.conf
> > > > # This section of this file was automatically generated by cmd.
> Do
> > > not edit manually!
> > > > # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE Name=gpu Name=mic
> > > > # END AUTOGENERATED SECTION   -- DO NOT REMOVE
> > > > Name=memdir Count=64
> > > >
> > > > (we will need to vary both these later to customize the resource
> > > > available on different nodes)
> > > >
> > > > (we'd like to try using 64G instead of 64 but just want it
> working
> > > > first)
> > > >
> > > > The resource seems to be set for a node:
> > > > > scontrol show node c001
> > > > NodeName=c001 Arch=x86_64 CoresPerSocket=10
> > > >    CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.08 Features=(null)
> > > >    Gres=memdir:64
> > > >    NodeAddr=c001 NodeHostName=c001 Version=14.03.0
> > > >    OS=Linux RealMemory=131072 AllocMem=0 Sockets=2 Boards=1
> > > >    State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
> > > >    BootTime=2014-12-23T12:03:22 SlurmdStartTime=2014-12-
> 23T01:06:05
> > > >    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> > > >    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> > > >
> > > > And seems to be available to use in principle:
> > > > > salloc --gres=help
> > > > Valid gres options are:
> > > > memdir[:count]
> > > >
> > > > But is not useable in practice:
> > > > > salloc --gres=memdir:16
> > > > salloc: error: Job submit/allocate failed: Invalid generic
> > > > resource
> > > > (gres) specification
> > > >
> > > > Can anyone see where we are going wrong?
> > > >
> > > > Gareth Williams
> > > >
> > > > ps. At some point we will also want to schedule gpus.

Reply via email to