I have got this all wrong. Paddy Doyle has got it right. However are you SURE than mpirun is not creating tasks on the other machines? I would look at the compute nodes while the job is running and do ps -eaf --forest
Also using mpirun to run a single core gives me the heebie-jeebies... https://en.wikipedia.org/wiki/Heebie-jeebies_(idiom) On 29 June 2018 at 13:16, Matteo Guglielmi <matteo.guglie...@dalco.ch> wrote: > You are right but I'm actually supporting the system administrator of that > cluster, I'll mention this to him. > > Beside that, > > the user runs this for loop to submit the jobs: > > > # submit.sh # > > typeset -i i=1 > typeset -i j=12500 #number of frames goes to each core = number of frames > (1000000)/40 (cores) = > typeset -i k=1 > > while [ $i -le 36 ] #the number of frames > do > > sbatch run-5o$i.sh $i $j $k > > i=$i+1 # number of frames goes to each node (5*200 = 1000) > done > > where each run-5oXX.sh jobfile looks like this: > > > #!/bin/bash > > #SBATCH --job-name=charmm-test > #SBATCH --nodes=1 > #SBATCH --ntasks=1 > #SBATCH --cpus-per-task=1 > > export PATH=/usr/lib64/openmpi/bin/:$PATH > export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH > > mpirun -np 1 /opt/cluster/programs/charmm/c42b2/exec/gnu_M/charmm < > newphcnl99a0.inp > newphcnl99a0.out > > > > > so they are all independent mpiruns... if one of them is killed, why > would all others go down as well? > > > That would make sense if a single mpirun is running 36 tasks... but the > user is not doing this. > > ________________________________ > From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > John Hearns <hear...@googlemail.com> > Sent: Friday, June 29, 2018 12:52:41 PM > To: Slurm User Community List > Subject: Re: [slurm-users] All user's jobs killed at the same time on all > nodes > > Matteo, a stupid question but if these are single CPU jobs why is mpirun > being used? > > Is your user using these 36 jobs to construct a parallel job to run charmm? > If the mpirun is killed, yes all the other processes which are started by > it on the other compute nodes will be killed. > > I suspect your user is trying to do womething "smart". You should give > that person an example of how to reserve 36 cores and submit a charmm job. > > > On 29 June 2018 at 12:13, Matteo Guglielmi <matteo.guglie...@dalco.ch< > mailto:matteo.guglie...@dalco.ch>> wrote: > Dear comunity, > > I have a user who usually submits 36 (identical) jobs at a time using a > simple for loop, > thus jobs are sbatched all the same time. > > Each job requests a single core and all jobs are independent from one > another (read > different input files and write to different output files). > > Jobs are then usually started during the next couple of hours, somewhat at > random > times. > > What happens then is that after a certain amount of time (maybe from 2 to > 12 hours) > ALL jobs belonging to this particular user are killed by slurm on all > nodes at exactly the > same time. > > One example: > > ### master: /var/log/slurmctld.log ### > > [2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560 > InitPrio=4294185624 usec=255 > ... > [2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on > node38 > ... > [2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560 > uid 1007 > [2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560 > State=0x8004 NodeCnt=1 successful 0x8004 > > ### node38: /var/log/slurmd.log ### > > [2018-06-28T19:29:05.410] _run_prolog: prolog with lock for job 718560 ran > for 0 seconds > [2018-06-28T19:29:05.410] Launching batch job 718560 for UID 1007 > [2018-06-28T19:29:05.427] [718560.batch] Munge cryptographic signature > plugin loaded > [2018-06-28T19:29:05.431] [718560.batch] debug level = 2 > [2018-06-28T19:29:05.431] [718560.batch] starting 1 tasks > [2018-06-28T19:29:05.431] [718560.batch] task 0 (69791) started > 2018-06-28T19:29:05 > [2018-06-28T19:29:05.440] [718560.batch] Can't propagate RLIMIT_NOFILE of > 65536 from submit host: Operation not permitted > ... > [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69794 > (charmm) > [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69792 > (mpirun) > [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69791 > (slurm_script) > [2018-06-28T23:37:53.480] [718560.batch] Sent signal 18 to 718560.429496729 > [2018-06-28T23:37:53.485] [718560.batch] error: *** JOB 718560 ON node38 > CANCELLED AT 2018-06-28T23:37:53 *** > [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69794 > (charmm) > [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69792 > (mpirun) > [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69791 > (slurm_script) > [2018-06-28T23:37:53.488] [718560.batch] Sent signal 15 to > 718560.4294967294 > [2018-06-28T23:37:53.492] [718560.batch] task 0 (69791) exited. Killed by > signal 15. > [2018-06-28T23:37:53.512] [718560.batch] job 718560 completed with > slurm_rc = 0, job_rc = 15 > [2018-06-28T23:37:53.512] [718560.batch] sending > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15 > [2018-06-28T23:37:53.516] [718560.batch] done with job > > The slurm cluster has a minimal configuration: > > ClusterName=cluster > ControlMachine=master > SchedulerType=sched/backfill > SelectType=select/cons_res > SelectTypeParameters=CR_Core > FastSchedule=1 > SlurmUser=slurm > SlurmdUser=root > SlurmctldPort=6817 > SlurmdPort=6818 > AuthType=auth/munge > StateSaveLocation=/var/spool/slurm/ > SlurmdSpoolDir=/var/spool/slurm/ > SwitchType=switch/none > MpiDefault=none > SlurmctldPidFile=/var/run/slurmctld.pid > SlurmdPidFile=/var/run/slurmd.pid > Proctracktype=proctrack/linuxproc > ReturnToService=2 > PropagatePrioProcess=0 > PropagateResourceLimitsExcept=MEMLOCK > TaskPlugin=task/cgroup > SlurmctldTimeout=300 > SlurmdTimeout=300 > InactiveLimit=0 > MinJobAge=300 > KillWait=30 > Waittime=0 > SlurmctldDebug=4 > SlurmctldLogFile=/var/log/slurmctld.log > SlurmdDebug=4 > SlurmdLogFile=/var/log/slurmd.log > JobCompType=jobcomp/none > JobAcctGatherType=jobacct_gather/cgroup > AccountingStorageType=accounting_storage/slurmdbd > AccountingStorageHost=master > AccountingStorageLoc=all > NodeName=node[01-45] Sockets=2 CoresPerSocket=10 State=UNKNOWN > PartitionName=partition Nodes=ALL Default=YES MaxTime=INFINITE State=UP > > Thank you for your help. > > > >