Re: [slurm-users] Job step aborted
Excuse me, how can I tell slurm not to terminate until all steps (tasks) are finished? Regards, Mahmood On Fri, May 18, 2018 at 10:35 AM, Mahmood Naderan wrote: > OK I understand that. However, there is a issue with ntasks=1. > Assume a user wants to launch an application with the number of cores > in the command line argument. Taking into mind that the cpu limit for > the partition is 20 cores, the following example > > [mahmood@rocks7 ~]$ srun --x11 -A y8 -p RUBY --mem=8GB --pty bash > [mahmood@compute-0-6 ~]$ /state/partition1/scfd/sc -t10 > > raises two problems: > 1- Slurm assumes that the user job is using only one core. That means > a user can create 20 interactive sessions and in each of the sessions > launch the program with 10 threads and bypassing the core limit I set > before. > > 2- The user that start the session with ntasks=1 (or not specifying > that) and then cheat the system by launching the program with more > than cpu limit (specifying -t50). > > Any idea? > > > > Regards, > Mahmood > > > > > On Thu, May 17, 2018 at 11:40 PM, Matthieu Hautreux > wrote: >> >> >> It means what is written : your job is terminated because 9 tasks out of 10 >> exited more than 60s before. >> >> The logic behind the 60 seconds (configurable) is described in the srun man >> page. You should look at it closely. >> >> You should also look at the FAQ here https://slurm.schedmd.com/faq.html. >> >> You should set --ntask=1, if I properly guess your goal. >> >> HTH >>
Re: [slurm-users] Job step aborted
OK I understand that. However, there is a issue with ntasks=1. Assume a user wants to launch an application with the number of cores in the command line argument. Taking into mind that the cpu limit for the partition is 20 cores, the following example [mahmood@rocks7 ~]$ srun --x11 -A y8 -p RUBY --mem=8GB --pty bash [mahmood@compute-0-6 ~]$ /state/partition1/scfd/sc -t10 raises two problems: 1- Slurm assumes that the user job is using only one core. That means a user can create 20 interactive sessions and in each of the sessions launch the program with 10 threads and bypassing the core limit I set before. 2- The user that start the session with ntasks=1 (or not specifying that) and then cheat the system by launching the program with more than cpu limit (specifying -t50). Any idea? Regards, Mahmood On Thu, May 17, 2018 at 11:40 PM, Matthieu Hautreux wrote: > > > It means what is written : your job is terminated because 9 tasks out of 10 > exited more than 60s before. > > The logic behind the 60 seconds (configurable) is described in the srun man > page. You should look at it closely. > > You should also look at the FAQ here https://slurm.schedmd.com/faq.html. > > You should set --ntask=1, if I properly guess your goal. > > HTH >
Re: [slurm-users] Job step aborted
Le jeu. 17 mai 2018 11:28, Mahmood Naderan a écrit : > Hi, > For an interactive job via srun, I see that after opening the gui, the > session is terminated automatically which is weird. > > [mahmood@rocks7 ansys_test]$ srun --x11 -A y8 -p RUBY --ntasks=10 > --mem=8GB --pty bash > [mahmood@compute-0-6 ansys_test]$ /state/partition1/scfd/sc -t10 > srun: First task exited 60s ago > srun: step:292.0 task 0: running > srun: step:292.0 tasks 1-9: exited > srun: Terminating job step 292.0 > srun: Job step aborted: Waiting up to 62 seconds for job step to finish. > srun: error: compute-0-6: task 0: Killed > > What does that mean? > It means what is written : your job is terminated because 9 tasks out of 10 exited more than 60s before. The logic behind the 60 seconds (configurable) is described in the srun man page. You should look at it closely. You should also look at the FAQ here https://slurm.schedmd.com/faq.html. You should set --ntask=1, if I properly guess your goal. HTH > Regards, > Mahmood > >
Re: [slurm-users] Job step aborted
I have opened a bug ticket at https://bugs.schedmd.com/show_bug.cgi?id=5182 It is annoying... Regards, Mahmood On Thu, May 17, 2018 at 1:54 PM, Mahmood Naderan wrote: > Hi, > For an interactive job via srun, I see that after opening the gui, the > session is terminated automatically which is weird. > > [mahmood@rocks7 ansys_test]$ srun --x11 -A y8 -p RUBY --ntasks=10 > --mem=8GB --pty bash > [mahmood@compute-0-6 ansys_test]$ /state/partition1/scfd/sc -t10 > srun: First task exited 60s ago > srun: step:292.0 task 0: running > srun: step:292.0 tasks 1-9: exited > srun: Terminating job step 292.0 > srun: Job step aborted: Waiting up to 62 seconds for job step to finish. > srun: error: compute-0-6: task 0: Killed > > What does that mean? > > Regards, > Mahmood
[slurm-users] Job step aborted
Hi, For an interactive job via srun, I see that after opening the gui, the session is terminated automatically which is weird. [mahmood@rocks7 ansys_test]$ srun --x11 -A y8 -p RUBY --ntasks=10 --mem=8GB --pty bash [mahmood@compute-0-6 ansys_test]$ /state/partition1/scfd/sc -t10 srun: First task exited 60s ago srun: step:292.0 task 0: running srun: step:292.0 tasks 1-9: exited srun: Terminating job step 292.0 srun: Job step aborted: Waiting up to 62 seconds for job step to finish. srun: error: compute-0-6: task 0: Killed What does that mean? Regards, Mahmood