Actually, if sinfo is responding, then slurmctld is running! I would
check the slurmctld log on your head node and the slurmd log on the
compute node(s) to look for hints regarding communication problems.
For example, can you ping back and forth between the 2 nodes? Do you
have a firewall running on either or both nodes? etc.
Andy
On 04/11/2016 11:16 PM, John Hearns
wrote:
Thankyou Lachlan.
Actually I was being old school and running
/etc/inir.d/slurm start
(yes - I know about systemd!)
I will try systemd�
for what its worth �/var/log/slurm.log on the compute node
reads:
[2016-04-12T03:52:18.813] Message aggregation disabled
[2016-04-12T03:52:18.814] error: _cpu_freq_cpu_avail:
Could not open
/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
[2016-04-12T03:52:18.814] Resource spec: Reserved system
memory limit not configured for this node
[2016-04-12T03:52:18.816] slurmd version 15.08.6 started
[2016-04-12T03:52:18.816] slurmd started on Tue, 12 Apr
2016 03:52:18 +0100
[2016-04-12T03:52:18.816] CPUs=16 Boards=1 Sockets=2
Cores=8 Threads=1 Memory=64318 TmpDisk=32159 Uptime=494183
CPUSpecList=(null)
*From:*
Lachlan Musicman [data...@gmail.com]
*Sent:* 12 April 2016 04:10
*To:* slurm-dev
*Subject:* [slurm-dev] Re: Slurm service timeout -
hints on diagnostics please?
I think I saw something like this just now -
are you running:
systemctl start slurm
or
systemctl start slurmd ?
And slurmctld is running on the head?
Cheers
L.
------
The most dangerous phrase in the language is,
"We've always done it this way."
- Grace Hopper
On 12 April 2016 at 13:04, John
Hearns <john.hea...@xma.co.uk>
wrote:
I am working on an
OpenHPC/Warewulf cluster.
When I start the slurmd service on the
compute nodes the systemctl sits there for a
long time,
then reports:
Starting slurm (via systemctl): �Job for
slurm.service failed because a timeout was
exceeded. See "systemctl status slurm.service"
and "journalctl -xe" for details.
� � � � � � � � � � � � � �
� � � � � � � �
� � � � � � � �[FAILED]
However a slurmd process is started on the
compute node.
On the head node
sinfo says the compute node is down and Not
Responding
I don;t expect my problem to be solved by the
list, but would appreciate some diagnostics.
debug level set to 3 in slurm.conf
Scanned by MailMarshal
- M86 Security's comprehensive email content
security solution.�
Any views or opinions presented in
this email are solely those of the author and do not
necessarily represent those of the company.
Employees of XMA Ltd are expressly required not to
make defamatory statements and not to infringe or
authorise any infringement of copyright or any other
legal right by email communications. Any such
communication is contrary to company policy and
outside the scope of the employment of the
individual concerned. The company will not accept
any liability in respect of such communication, and
the employee responsible will be personally liable
for any damages or other liability arising. XMA
Limited is registered in England and Wales
(registered no. 2051703). Registered Office: Wilford
Industrial Estate, Ruddington Lane, Wilford,
Nottingham, NG11 7EP
Scanned by MailMarshal
- M86 Security's comprehensive email content security
solution.�
Any views or opinions presented in this email are
solely those of the author and do not necessarily represent those
of the company. Employees of XMA Ltd are expressly required not to
make defamatory statements and not to infringe or authorise any
infringement of copyright or any other legal right by email
communications. Any such communication is contrary to company
policy and outside the scope of the employment of the individual
concerned. The company will not accept any liability in respect of
such communication, and the employee responsible will be
personally liable for any damages or other liability arising. XMA
Limited is registered in England and Wales (registered no.
2051703). Registered Office: Wilford Industrial Estate, Ruddington
Lane, Wilford, Nottingham, NG11 7EP