No, slurmctld isn't running. Now. It was when I started, but I suspect I made at least one mod too many to slurm.conf. When I try to start slurmctld, I get these in slurmctld.log: [2014-08-21T09:30:09.626] debug2: No ApbasilTimeout configured (65534) [2014-08-21T09:30:09.630] debug2: No ApbasilTimeout configured (65534) [2014-08-21T09:30:09.673] fatal: system has no usable batch compute nodes
I've just made a mod to slurm.conf that makes sure there's a default partition. I'd had named partitions in previously, but got some errors and warnings when trying to get the partition naming right in #SBATCH, so I'd gone back to the default config. This appears to have started with a reboot several days ago. I'm now making sure it's not something deeper causing a Gemini network problem. Thanks, Trey! gerry On Wed, Aug 20, 2014 at 10:11 PM, Trey Dockendorf <treyd...@tamu.edu> wrote: > > Is slurmctld running? My guess is that you need at least one partition > defined in addition to the DEFAULT partition. Try creating a partition > with any name, which will inherit everything from DEFAULT. > > - Trey > > ============================= > > Trey Dockendorf > Systems Analyst I > Texas A&M University > Academy for Advanced Telecommunications and Learning Technologies > Phone: (979)458-2396 > Email: treyd...@tamu.edu > Jabber: treyd...@tamu.edu > > ----- Original Message ----- > > From: "Gerry Creager - NOAA Affiliate" <gerry.crea...@noaa.gov> > > To: "slurm-dev" <slurm-dev@schedmd.com> > > Sent: Wednesday, August 20, 2014 4:40:40 PM > > Subject: [slurm-dev] Re: Error: Unable to contact slurm controller > > > > > > Hi, Trey > > > > > > That's what I am intuiting, as well, but: > > > > > > > > gerry@loki:~/software/wrf/NME/DART_Lanai/models/wrf/work> egrep > > '^(PartitionName|NodeName)' /opt/slurm/default/etc/slurm.conf > > > NodeName=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287] > > Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=65536 > > PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60 > > > Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287] > > MaxNodes=12 > > > > > > looks pretty normal. > > > > > > gerry > > > > > > > > > > > > On Wed, Aug 20, 2014 at 4:25 PM, Trey Dockendorf < treyd...@tamu.edu > > > wrote: > > > > > > > > What's your slurm.conf look like? Do you have valid Nodes and > > Partitions defined? > > > > For example: > > > > egrep '^(PartitionName|NodeName)' /etc/slurm/slurm.conf > > > > Sounds like invalid slurm.conf is preventing slurmctld from starting. > > > > - Trey > > > > ============================= > > > > Trey Dockendorf > > Systems Analyst I > > Texas A&M University > > Academy for Advanced Telecommunications and Learning Technologies > > Phone: (979)458-2396 > > Email: treyd...@tamu.edu > > Jabber: treyd...@tamu.edu > > > > > > > > ----- Original Message ----- > > > From: "Gerry Creager - NOAA Affiliate" < gerry.crea...@noaa.gov > > > > To: "slurm-dev" < slurm-dev@schedmd.com > > > > Sent: Wednesday, August 20, 2014 4:09:25 PM > > > Subject: [slurm-dev] Re: Error: Unable to contact slurm controller > > > > > > > > > Moe, > > > > > > > > > Thanks. I've tried. I'm noting a pair of errors in the > > > slurmctld.log > > > file: > > > > > > > > > > > > 2014-08-20T15:58:58.458] debug: No DownNodes > > > [2014-08-20T15:58:58.458] fatal: No PartitionName information > > > available! > > > > > > > > > So far, Google hasn't helped me much in this regard. > > > > > > > > > gerry > > > > > > > > > > > > On Wed, Aug 20, 2014 at 11:39 AM, < je...@schedmd.com > wrote: > > > > > > > > > > > > Try this: > > > http://slurm.schedmd.com/ troubleshoot.html > > > > > > > > > > > > Quoting Gerry Creager - NOAA Affiliate < gerry.crea...@noaa.gov >: > > > > > > > > > > > > I'm trying to learn how to use and administer slurm on a new Cray > > > system, > > > and started seeing this yesterday: > > > squeue > > > slurm_load_jobs error: Unable to contact slurm controller (connect > > > failure) > > > > > > I'm at a loss as to how to proceed. > > > > > > Thanks, Gerry > > > -- > > > Gerry Creager > > > NSSL/CIMMS > > > 405.325.6371 > > > ++++++++++++++++++++++ > > > “Big whorls have little whorls, > > > That feed on their velocity; > > > And little whorls have lesser whorls, > > > And so on to viscosity.” > > > Lewis Fry Richardson (1881-1953) > > > > > > > > > -- > > > Morris "Moe" Jette > > > CTO, SchedMD LLC > > > > > > Slurm User Group Meeting > > > September 23-24, Lugano, Switzerland > > > Find out more http://slurm.schedmd.com/ slurm_ug_agenda.html > > > > > > > > > > > > > > > -- > > > > > > Gerry Creager > > > NSSL/CIMMS > > > 405.325.6371 > > > ++++++++++++++++++++++ > > > > > > “Big whorls have little whorls, > > > That feed on their velocity; > > > And little whorls have lesser whorls, > > > And so on to viscosity.” > > > Lewis Fry Richardson (1881-1953) > > > > > > > > -- > > > > Gerry Creager > > NSSL/CIMMS > > 405.325.6371 > > ++++++++++++++++++++++ > > > > “Big whorls have little whorls, > > That feed on their velocity; > > And little whorls have lesser whorls, > > And so on to viscosity.” > > Lewis Fry Richardson (1881-1953) > -- Gerry Creager NSSL/CIMMS 405.325.6371 ++++++++++++++++++++++ “Big whorls have little whorls, That feed on their velocity; And little whorls have lesser whorls, And so on to viscosity.” Lewis Fry Richardson (1881-1953)