[slurm-dev] Re: Slurm service timeout - hints on diagnostics please?

Andy Riebs Tue, 12 Apr 2016 06:58:29 -0700
   Actually, if sinfo is responding, then slurmctld is running! I would
 check the slurmctld log on your head node and the slurmd log on the
 compute node(s) to look for hints regarding communication problems.
 For example, can you ping back and forth between the 2 nodes? Do you
 have a firewall running on either or both nodes? etc.
 
 Andy
 
 On 04/11/2016 11:16 PM, John Hearns
   wrote:
     Thankyou Lachlan.
     Actually I was being old school and running
       /etc/inir.d/slurm start
     (yes - I know about systemd!)
     I will try systemdï¿½
     for what its worth ï¿½/var/log/slurm.log on the compute node
       reads:
       [2016-04-12T03:52:18.813] Message aggregation disabled
       [2016-04-12T03:52:18.814] error: _cpu_freq_cpu_avail:
         Could not open
         /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
       [2016-04-12T03:52:18.814] Resource spec: Reserved system
         memory limit not configured for this node
       [2016-04-12T03:52:18.816] slurmd version 15.08.6 started
       [2016-04-12T03:52:18.816] slurmd started on Tue, 12 Apr
         2016 03:52:18 +0100
       [2016-04-12T03:52:18.816] CPUs=16 Boards=1 Sockets=2
         Cores=8 Threads=1 Memory=64318 TmpDisk=32159 Uptime=494183
         CPUSpecList=(null)
       *From:*
           Lachlan Musicman [data...@gmail.com]
           *Sent:* 12 April 2016 04:10
           *To:* slurm-dev
           *Subject:* [slurm-dev] Re: Slurm service timeout -
           hints on diagnostics please?
                   I think I saw something like this just now -
                     are you running:
                   systemctl start slurm
                 or 
               systemctl start slurmd ?
             And slurmctld is running on the head?
             Cheers
           
           L.
                 ------
                   The most dangerous phrase in the language is,
                   "We've always done it this way."
                   
                   - Grace Hopper
           On 12 April 2016 at 13:04, John
             Hearns <john.hea...@xma.co.uk>
             wrote:
                 I am working on an
                   OpenHPC/Warewulf cluster.
                   When I start the slurmd service on the
                     compute nodes the systemctl sits there for a
                     long time,
                   then reports:
                     Starting slurm (via systemctl): ï¿½Job for
                       slurm.service failed because a timeout was
                       exceeded. See "systemctl status slurm.service"
                       and "journalctl -xe" for details.
                     ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ 
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½
                       ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½[FAILED]
                   However a slurmd process is started on the
                     compute node.
                   On the head node
                       sinfo says the compute node is down and Not
                       Responding
                   I don;t expect my problem to be solved by the
                     list, but would appreciate some diagnostics.
                   debug level set to 3 in slurm.conf
                       Scanned by MailMarshal
                     
                       - M86 Security's comprehensive email content
                       security solution.ï¿½
                   Any views or opinions presented in
                 this email are solely those of the author and do not
                 necessarily represent those of the company.
                 Employees of XMA Ltd are expressly required not to
                 make defamatory statements and not to infringe or
                 authorise any infringement of copyright or any other
                 legal right by email communications. Any such
                 communication is contrary to company policy and
                 outside the scope of the employment of the
                 individual concerned. The company will not accept
                 any liability in respect of such communication, and
                 the employee responsible will be personally liable
                 for any damages or other liability arising. XMA
                 Limited is registered in England and Wales
                 (registered no. 2051703). Registered Office: Wilford
                 Industrial Estate, Ruddington Lane, Wilford,
                 Nottingham, NG11 7EP 
         Scanned by MailMarshal
       
         - M86 Security's comprehensive email content security
         solution.ï¿½
     Any views or opinions presented in this email are
   solely those of the author and do not necessarily represent those
   of the company. Employees of XMA Ltd are expressly required not to
   make defamatory statements and not to infringe or authorise any
   infringement of copyright or any other legal right by email
   communications. Any such communication is contrary to company
   policy and outside the scope of the employment of the individual
   concerned. The company will not accept any liability in respect of
   such communication, and the employee responsible will be
   personally liable for any damages or other liability arising. XMA
   Limited is registered in England and Wales (registered no.
   2051703). Registered Office: Wilford Industrial Estate, Ruddington
   Lane, Wilford, Nottingham, NG11 7EP
[slurm-dev] Re: Slurm service timeout - hints on diagnostics please?

Reply via email to