No, slurmctld isn't running. Now. It was when I started, but I suspect I
made at least one mod too many to slurm.conf. When I try to start
slurmctld, I get these in slurmctld.log:
[2014-08-21T09:30:09.626] debug2: No ApbasilTimeout configured (65534)
[2014-08-21T09:30:09.630] debug2: No ApbasilTimeout configured (65534)
[2014-08-21T09:30:09.673] fatal: system has no usable batch compute nodes


I've just made a mod to slurm.conf that makes sure there's a default
partition. I'd had named partitions in previously, but got some errors and
warnings when trying to get the partition naming right in #SBATCH, so I'd
gone back to the default config.

This appears to have started with a reboot several days ago. I'm now making
sure it's not something deeper causing a Gemini network problem.

Thanks, Trey!
gerry


On Wed, Aug 20, 2014 at 10:11 PM, Trey Dockendorf <treyd...@tamu.edu> wrote:

>
> Is slurmctld running?  My guess is that you need at least one partition
> defined in addition to the DEFAULT partition.  Try creating a partition
> with any name, which will inherit everything from DEFAULT.
>
> - Trey
>
> =============================
>
> Trey Dockendorf
> Systems Analyst I
> Texas A&M University
> Academy for Advanced Telecommunications and Learning Technologies
> Phone: (979)458-2396
> Email: treyd...@tamu.edu
> Jabber: treyd...@tamu.edu
>
> ----- Original Message -----
> > From: "Gerry Creager - NOAA Affiliate" <gerry.crea...@noaa.gov>
> > To: "slurm-dev" <slurm-dev@schedmd.com>
> > Sent: Wednesday, August 20, 2014 4:40:40 PM
> > Subject: [slurm-dev] Re: Error: Unable to contact slurm controller
> >
> >
> > Hi, Trey
> >
> >
> > That's what I am intuiting, as well, but:
> >
> >
> >
> > gerry@loki:~/software/wrf/NME/DART_Lanai/models/wrf/work> egrep
> > '^(PartitionName|NodeName)' /opt/slurm/default/etc/slurm.conf
> >
> NodeName=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
> > Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=65536
> > PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60
> >
> Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,280-287]
> > MaxNodes=12
> >
> >
> > looks pretty normal.
> >
> >
> > gerry
> >
> >
> >
> >
> >
> > On Wed, Aug 20, 2014 at 4:25 PM, Trey Dockendorf < treyd...@tamu.edu
> > > wrote:
> >
> >
> >
> > What's your slurm.conf look like? Do you have valid Nodes and
> > Partitions defined?
> >
> > For example:
> >
> > egrep '^(PartitionName|NodeName)' /etc/slurm/slurm.conf
> >
> > Sounds like invalid slurm.conf is preventing slurmctld from starting.
> >
> > - Trey
> >
> > =============================
> >
> > Trey Dockendorf
> > Systems Analyst I
> > Texas A&M University
> > Academy for Advanced Telecommunications and Learning Technologies
> > Phone: (979)458-2396
> > Email: treyd...@tamu.edu
> > Jabber: treyd...@tamu.edu
> >
> >
> >
> > ----- Original Message -----
> > > From: "Gerry Creager - NOAA Affiliate" < gerry.crea...@noaa.gov >
> > > To: "slurm-dev" < slurm-dev@schedmd.com >
> > > Sent: Wednesday, August 20, 2014 4:09:25 PM
> > > Subject: [slurm-dev] Re: Error: Unable to contact slurm controller
> > >
> > >
> > > Moe,
> > >
> > >
> > > Thanks. I've tried. I'm noting a pair of errors in the
> > > slurmctld.log
> > > file:
> > >
> > >
> > >
> > > 2014-08-20T15:58:58.458] debug: No DownNodes
> > > [2014-08-20T15:58:58.458] fatal: No PartitionName information
> > > available!
> > >
> > >
> > > So far, Google hasn't helped me much in this regard.
> > >
> > >
> > > gerry
> > >
> > >
> > >
> > > On Wed, Aug 20, 2014 at 11:39 AM, < je...@schedmd.com > wrote:
> > >
> > >
> > >
> > > Try this:
> > > http://slurm.schedmd.com/ troubleshoot.html
> > >
> > >
> > >
> > > Quoting Gerry Creager - NOAA Affiliate < gerry.crea...@noaa.gov >:
> > >
> > >
> > >
> > > I'm trying to learn how to use and administer slurm on a new Cray
> > > system,
> > > and started seeing this yesterday:
> > > squeue
> > > slurm_load_jobs error: Unable to contact slurm controller (connect
> > > failure)
> > >
> > > I'm at a loss as to how to proceed.
> > >
> > > Thanks, Gerry
> > > --
> > > Gerry Creager
> > > NSSL/CIMMS
> > > 405.325.6371
> > > ++++++++++++++++++++++
> > > “Big whorls have little whorls,
> > > That feed on their velocity;
> > > And little whorls have lesser whorls,
> > > And so on to viscosity.”
> > > Lewis Fry Richardson (1881-1953)
> > >
> > >
> > > --
> > > Morris "Moe" Jette
> > > CTO, SchedMD LLC
> > >
> > > Slurm User Group Meeting
> > > September 23-24, Lugano, Switzerland
> > > Find out more http://slurm.schedmd.com/ slurm_ug_agenda.html
> > >
> > >
> > >
> > >
> > > --
> > >
> > > Gerry Creager
> > > NSSL/CIMMS
> > > 405.325.6371
> > > ++++++++++++++++++++++
> > >
> > > “Big whorls have little whorls,
> > > That feed on their velocity;
> > > And little whorls have lesser whorls,
> > > And so on to viscosity.”
> > > Lewis Fry Richardson (1881-1953)
> >
> >
> >
> > --
> >
> > Gerry Creager
> > NSSL/CIMMS
> > 405.325.6371
> > ++++++++++++++++++++++
> >
> > “Big whorls have little whorls,
> > That feed on their velocity;
> > And little whorls have lesser whorls,
> > And so on to viscosity.”
> > Lewis Fry Richardson (1881-1953)
>



-- 
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)

Reply via email to