[slurm-dev] Re: Node switching to DRAIN for unknown reason, trouble shooting ideas?

Paddy Doyle Wed, 01 Feb 2017 04:43:35 -0800

Similar to Lachlan's suggestions: check that the slurm.conf is the same on
all nodes, and in particular that the number of cpus and cores are correct.


Have you tried removing the Gres parameters? Perhaps it's looking for devices
it can't find.

Paddy

On Tue, Jan 31, 2017 at 02:08:51PM -0800, Lachlan Musicman wrote:

> trival questions: does node has correct time wrt head node?  and is node
> correctly configured in slurm.conf? (# of cpus, amount of memory, etc)
> 
> cheers
> L.
> 
> ------
> The most dangerous phrase in the language is, "We've always done it this
> way."
> 
> - Grace Hopper
> 
> On 1 February 2017 at 08:03, E V <eliven...@gmail.com> wrote:
> 
> >
> > enabling debug5 doesn't show anything more useful. I don't see
> > anything relevant in slurmd.log just job starts and stops.
> > slurmctld.log has the takeover output with backup head node
> > immediately draining itself same as before but with more of the
> > context before the DRAIN:
> >
> > [2017-01-31T15:37:38.387] debug:  Spawning registration agent for
> > bkr,hpcc-1,r1-[01-07] 9 hosts
> > [2017-01-31T15:37:38.387] debug2: Spawning RPC agent for msg_type
> > REQUEST_NODE_REGISTRATION_STATUS
> > [2017-01-31T15:37:38.387] debug2: got 1 threads to send out
> > [2017-01-31T15:37:38.388] debug3: Tree sending to bkr
> > [2017-01-31T15:37:38.388] debug2: slurm_connect failed: Connection refused
> > [2017-01-31T15:37:38.388] debug2: Error connecting slurm stream socket
> > at 172.18.1.102:6820: Connection refused
> > [2017-01-31T15:37:38.388] debug3: connect refused, retrying
> > [2017-01-31T15:37:38.388] debug2: Tree head got back 0 looking for 9
> > [2017-01-31T15:37:38.388] debug3: Tree sending to hpcc-1
> > [2017-01-31T15:37:38.389] debug3: Tree sending to r1-01
> > [2017-01-31T15:37:38.389] debug3: Tree sending to r1-02
> > [2017-01-31T15:37:38.389] debug3: Tree sending to r1-03
> > [2017-01-31T15:37:38.389] debug3: Tree sending to r1-04
> > [2017-01-31T15:37:38.389] debug3: Tree sending to r1-05
> > [2017-01-31T15:37:38.390] debug3: Tree sending to r1-07
> > [2017-01-31T15:37:38.390] debug3: Tree sending to r1-06
> > [2017-01-31T15:37:38.390] debug4: orig_timeout was 10000 we have 0
> > steps and a timeout of 10000
> > [2017-01-31T15:37:38.390] debug4: orig_timeout was 10000 we have 0
> > steps and a timeout of 10000
> > [2017-01-31T15:37:38.391] debug4: orig_timeout was 10000 we have 0
> > steps and a timeout of 10000
> > [2017-01-31T15:37:38.391] debug4: orig_timeout was 10000 we have 0
> > steps and a timeout of 10000
> > [2017-01-31T15:37:38.391] debug4: orig_timeout was 10000 we have 0
> > steps and a timeout of 10000
> > [2017-01-31T15:37:38.391] debug4: orig_timeout was 10000 we have 0
> > steps and a timeout of 10000
> > [2017-01-31T15:37:38.391] debug4: orig_timeout was 10000 we have 0
> > steps and a timeout of 10000
> > [2017-01-31T15:37:38.391] debug4: orig_timeout was 10000 we have 0
> > steps and a timeout of 10000
> > [2017-01-31T15:37:38.392] debug2: Tree head got back 1
> > [2017-01-31T15:37:38.392] debug2: Tree head got back 2
> > [2017-01-31T15:37:38.392] debug2: Tree head got back 3
> > [2017-01-31T15:37:38.392] debug2: Tree head got back 4
> > [2017-01-31T15:37:38.393] debug2: Processing RPC:
> > MESSAGE_NODE_REGISTRATION_STATUS from uid=0
> > [2017-01-31T15:37:38.393] error: Setting node hpcc-1 state to DRAIN
> > [2017-01-31T15:37:38.393] drain_nodes: node hpcc-1 state set to DRAIN
> > [2017-01-31T15:37:38.393] error: _slurm_rpc_node_registration
> > node=hpcc-1: Invalid argument
> > [2017-01-31T15:37:38.403] debug2: Processing RPC:
> > MESSAGE_NODE_REGISTRATION_STATUS from uid=0
> > [2017-01-31T15:37:38.403] debug3: Registered job 1932073.0 on node r1-05
> > [2017-01-31T15:37:38.403] debug3: resetting job_count on node r1-05 from 1
> > to 2
> > [2017-01-31T15:37:38.403] debug2: _slurm_rpc_node_registration
> > complete for r1-05 usec=76
> > [2017-01-31T15:37:38.404] debug2: Tree head got back 5
> > [2017-01-31T15:37:38.405] debug2: Tree head got back 6
> >
> > On Tue, Jan 31, 2017 at 9:54 AM, E V <eliven...@gmail.com> wrote:
> > >
> > > No eplilog scripts defined, and access to save state is fine, as an
> > > scontrol takeover works, but does have the side affect of the backup
> > > draining itself. I set SlurmctlDebug to debug3 and didn't get much
> > > more info:
> > > [2017-01-31T09:45:22.329] debug2: node_did_resp hpcc-1
> > > [2017-01-31T09:45:22.329] debug2: node_did_resp r1-07
> > > [2017-01-31T09:45:22.329] debug2: node_did_resp r1-03
> > > [2017-01-31T09:45:22.329] debug2: node_did_resp r1-05
> > > [2017-01-31T09:45:22.329] debug2: node_did_resp r1-02
> > > [2017-01-31T09:45:22.329] debug2: node_did_resp r1-04
> > > [2017-01-31T09:45:22.329] debug2: node_did_resp r1-01
> > > [2017-01-31T09:45:22.341] debug2: Processing RPC:
> > > MESSAGE_NODE_REGISTRATION_STATUS from uid=0
> > > [2017-01-31T09:45:22.341] error: Setting node hpcc-1 state to DRAIN
> > > [2017-01-31T09:45:22.341] drain_nodes: node hpcc-1 state set to DRAIN
> > > [2017-01-31T09:45:22.341] error: _slurm_rpc_node_registration
> > > node=hpcc-1: Invalid argument
> > >
> > > I'll try turning it up to debug5 and also enable SlurmdDebug to see if
> > > that shows anything.
> > >
> > > On Mon, Jan 30, 2017 at 12:42 PM, Paddy Doyle <pa...@tchpc.tcd.ie>
> > wrote:
> > >>
> > >> Hi E V,
> > >>
> > >> You could turn up the SlurmctldDebug and SlurmdDebug values in
> > slurm.conf to get
> > >> it to be more verbose.
> > >>
> > >> Do you have any epilog scripts defined?
> > >>
> > >> If it's related to the node being the backup controller, as a wild
> > guess,
> > >> perhaps your backup control doesn't have access to the StateSaveLocation
> > >> directory?
> > >>
> > >> Paddy
> > >>
> > >> On Mon, Jan 30, 2017 at 07:38:39AM -0800, E V wrote:
> > >>
> > >>>
> > >>> Running slurm 15.08.12 on a debian 8 system we have a node that keeps
> > >>> being drained and I can't tell why. From slurmctld.log on our ctld
> > >>> primary:
> > >>>
> > >>> [2017-01-28T06:45:29.961] error: Setting node hpcc-1 state to DRAIN
> > >>> [2017-01-28T06:45:29.961] drain_nodes: node hpcc-1 state set to DRAIN
> > >>> [2017-01-28T06:45:29.961] error: _slurm_rpc_node_registration
> > >>> node=hpcc-1: Invalid argument
> > >>>
> > >>> The slurmd.log on the node itself shows normal job completion message
> > >>> just before, and then nothing immediately after the drain:
> > >>>
> > >>> [2017-01-28T06:42:46.563] _run_prolog: run job script took usec=7
> > >>> [2017-01-28T06:42:46.563] _run_prolog: prolog with lock for job
> > >>> 1930101 ran for 0 seconds
> > >>> [2017-01-28T06:42:48.427] [1930101.0] done with job
> > >>> [2017-01-28T14:37:26.365] [1928122] sending
> > >>> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
> > >>> [2017-01-28T14:37:26.367] [1928122] done with job
> > >>>
> > >>> Any thoughts for figuring/fixing this? The node going to drain also
> > >>> happens to be our backup controller if that may be related to things:
> > >>> $ grep hpcc-1 slurm.conf
> > >>> BackupController=hpcc-1
> > >>> NodeName=hpcc-1 Gres=xld:1,xcd:1 Sockets=2 CoresPerSocket=6
> > >>> ThreadsPerCore=2 State=UNKNOWN RealMemory=48000 TmpDisk=500000
> > >>> PartitionName=headNode Nodes=hpcc-1 Default=NO MaxTime=INFINITE
> > State=UP
> > >>>
> > >>> This is on our testing/development grid systems so we can easily make
> > >>> changes to debug/fix the problem.
> > >>>
> > >>
> > >> --
> > >> Paddy Doyle
> > >> Trinity Centre for High Performance Computing,
> > >> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> > >> Phone: +353-1-896-3725
> > >> http://www.tchpc.tcd.ie/
> >

-- 
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/

[slurm-dev] Re: Node switching to DRAIN for unknown reason, trouble shooting ideas?

Reply via email to