Re: [slurm-users] All user's jobs killed at the same time on all nodes

John Hearns Fri, 29 Jun 2018 04:26:37 -0700

I have got this all wrong. Paddy Doyle has got it right.

However are you SURE than mpirun is not creating tasks on the other
machines?
I would look at the compute nodes while the job is running and do
ps -eaf --forest


Also using mpirun to run a single core gives me the heebie-jeebies...

https://en.wikipedia.org/wiki/Heebie-jeebies_(idiom)




On 29 June 2018 at 13:16, Matteo Guglielmi <matteo.guglie...@dalco.ch>
wrote:

> You are right but I'm actually supporting the system administrator of that
> cluster, I'll mention this to him.
>
> Beside that,
>
> the user runs this for loop to submit the jobs:
>
>
> # submit.sh #
>
> typeset -i i=1
> typeset -i j=12500  #number of frames goes to each core = number of frames
> (1000000)/40 (cores) =
> typeset -i k=1
>
> while [ $i -le 36 ]  #the number of frames
> do
>
> sbatch run-5o$i.sh $i $j $k
>
> i=$i+1 # number of frames goes to each node (5*200 = 1000)
> done
>
> where each run-5oXX.sh jobfile looks like this:
>
>
> #!/bin/bash
>
> #SBATCH --job-name=charmm-test
> #SBATCH --nodes=1
> #SBATCH --ntasks=1
> #SBATCH --cpus-per-task=1
>
> export PATH=/usr/lib64/openmpi/bin/:$PATH
> export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
>
> mpirun -np 1 /opt/cluster/programs/charmm/c42b2/exec/gnu_M/charmm <
> newphcnl99a0.inp > newphcnl99a0.out
>
>
>
>
> so they are all independent mpiruns...  if one of them is killed, why
> would all others go down as well?
>
>
> That would make sense if a single mpirun is running 36 tasks... but the
> user is not doing this.
>
> ________________________________
> From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of
> John Hearns <hear...@googlemail.com>
> Sent: Friday, June 29, 2018 12:52:41 PM
> To: Slurm User Community List
> Subject: Re: [slurm-users] All user's jobs killed at the same time on all
> nodes
>
> Matteo, a stupid question but if these are single CPU jobs why is mpirun
> being used?
>
> Is your user using these 36 jobs to construct a parallel job to run charmm?
> If the mpirun is killed, yes all the other processes which are started by
> it on the other compute nodes will be killed.
>
> I suspect your user is trying to do womething "smart". You should give
> that person an example of how to reserve 36 cores and submit a charmm job.
>
>
> On 29 June 2018 at 12:13, Matteo Guglielmi <matteo.guglie...@dalco.ch<
> mailto:matteo.guglie...@dalco.ch>> wrote:
> Dear comunity,
>
> I have a user who usually submits 36 (identical) jobs at a time using a
> simple for loop,
> thus jobs are sbatched all the same time.
>
> Each job requests a single core and all jobs are independent from one
> another (read
> different input files and write to different output files).
>
> Jobs are then usually started during the next couple of hours, somewhat at
> random
> times.
>
> What happens then is that after a certain amount of time (maybe from 2 to
> 12 hours)
> ALL jobs belonging to this particular user are killed by slurm on all
> nodes at exactly the
> same time.
>
> One example:
>
> ### master: /var/log/slurmctld.log ###
>
> [2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560
> InitPrio=4294185624 usec=255
> ...
> [2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on
> node38
> ...
> [2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560
> uid 1007
> [2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560
> State=0x8004 NodeCnt=1 successful 0x8004
>
> ### node38: /var/log/slurmd.log ###
>
> [2018-06-28T19:29:05.410] _run_prolog: prolog with lock for job 718560 ran
> for 0 seconds
> [2018-06-28T19:29:05.410] Launching batch job 718560 for UID 1007
> [2018-06-28T19:29:05.427] [718560.batch] Munge cryptographic signature
> plugin loaded
> [2018-06-28T19:29:05.431] [718560.batch] debug level = 2
> [2018-06-28T19:29:05.431] [718560.batch] starting 1 tasks
> [2018-06-28T19:29:05.431] [718560.batch] task 0 (69791) started
> 2018-06-28T19:29:05
> [2018-06-28T19:29:05.440] [718560.batch] Can't propagate RLIMIT_NOFILE of
> 65536 from submit host: Operation not permitted
> ...
> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69794
> (charmm)
> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69792
> (mpirun)
> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69791
> (slurm_script)
> [2018-06-28T23:37:53.480] [718560.batch] Sent signal 18 to 718560.429496729
> [2018-06-28T23:37:53.485] [718560.batch] error: *** JOB 718560 ON node38
> CANCELLED AT 2018-06-28T23:37:53 ***
> [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69794
> (charmm)
> [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69792
> (mpirun)
> [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69791
> (slurm_script)
> [2018-06-28T23:37:53.488] [718560.batch] Sent signal 15 to
> 718560.4294967294
> [2018-06-28T23:37:53.492] [718560.batch] task 0 (69791) exited. Killed by
> signal 15.
> [2018-06-28T23:37:53.512] [718560.batch] job 718560 completed with
> slurm_rc = 0, job_rc = 15
> [2018-06-28T23:37:53.512] [718560.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
> [2018-06-28T23:37:53.516] [718560.batch] done with job
>
> The slurm cluster has a minimal configuration:
>
> ClusterName=cluster
> ControlMachine=master
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core
> FastSchedule=1
> SlurmUser=slurm
> SlurmdUser=root
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> StateSaveLocation=/var/spool/slurm/
> SlurmdSpoolDir=/var/spool/slurm/
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmdPidFile=/var/run/slurmd.pid
> Proctracktype=proctrack/linuxproc
> ReturnToService=2
> PropagatePrioProcess=0
> PropagateResourceLimitsExcept=MEMLOCK
> TaskPlugin=task/cgroup
> SlurmctldTimeout=300
> SlurmdTimeout=300
> InactiveLimit=0
> MinJobAge=300
> KillWait=30
> Waittime=0
> SlurmctldDebug=4
> SlurmctldLogFile=/var/log/slurmctld.log
> SlurmdDebug=4
> SlurmdLogFile=/var/log/slurmd.log
> JobCompType=jobcomp/none
> JobAcctGatherType=jobacct_gather/cgroup
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageHost=master
> AccountingStorageLoc=all
> NodeName=node[01-45] Sockets=2 CoresPerSocket=10 State=UNKNOWN
> PartitionName=partition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>
> Thank you for your help.
>
>
>
>

Re: [slurm-users] All user's jobs killed at the same time on all nodes

Reply via email to