Re: [slurm-users] Job step aborted

2018-05-19 Thread Mahmood Naderan
Excuse me, how can I tell slurm not to terminate until all steps
(tasks) are finished?

Regards,
Mahmood




On Fri, May 18, 2018 at 10:35 AM, Mahmood Naderan  wrote:
> OK I understand that. However, there is a issue with ntasks=1.
> Assume a user wants to launch an application with the number of cores
> in the command line argument. Taking into mind that the cpu limit for
> the partition is 20 cores, the following example
>
> [mahmood@rocks7 ~]$ srun --x11 -A y8 -p RUBY --mem=8GB --pty bash
> [mahmood@compute-0-6 ~]$ /state/partition1/scfd/sc -t10
>
> raises two problems:
> 1- Slurm assumes that the user job is using only one core. That means
> a user can create 20 interactive sessions and in each of the sessions
> launch the program with 10 threads and bypassing the core limit I set
> before.
>
> 2- The user that start the session with ntasks=1 (or not specifying
> that) and then cheat the system by launching the program with more
> than cpu limit (specifying -t50).
>
> Any idea?
>
>
>
> Regards,
> Mahmood
>
>
>
>
> On Thu, May 17, 2018 at 11:40 PM, Matthieu Hautreux
>  wrote:
>>
>>
>> It means what is written : your job is terminated because 9 tasks out of 10
>> exited more than 60s before.
>>
>> The logic behind the 60 seconds (configurable) is described in the srun man
>> page. You should look at it closely.
>>
>> You should also look at the FAQ here https://slurm.schedmd.com/faq.html.
>>
>> You should set --ntask=1, if I properly guess your goal.
>>
>> HTH
>>



Re: [slurm-users] Job step aborted

2018-05-17 Thread Mahmood Naderan
OK I understand that. However, there is a issue with ntasks=1.
Assume a user wants to launch an application with the number of cores
in the command line argument. Taking into mind that the cpu limit for
the partition is 20 cores, the following example

[mahmood@rocks7 ~]$ srun --x11 -A y8 -p RUBY --mem=8GB --pty bash
[mahmood@compute-0-6 ~]$ /state/partition1/scfd/sc -t10

raises two problems:
1- Slurm assumes that the user job is using only one core. That means
a user can create 20 interactive sessions and in each of the sessions
launch the program with 10 threads and bypassing the core limit I set
before.

2- The user that start the session with ntasks=1 (or not specifying
that) and then cheat the system by launching the program with more
than cpu limit (specifying -t50).

Any idea?



Regards,
Mahmood




On Thu, May 17, 2018 at 11:40 PM, Matthieu Hautreux
 wrote:
>
>
> It means what is written : your job is terminated because 9 tasks out of 10
> exited more than 60s before.
>
> The logic behind the 60 seconds (configurable) is described in the srun man
> page. You should look at it closely.
>
> You should also look at the FAQ here https://slurm.schedmd.com/faq.html.
>
> You should set --ntask=1, if I properly guess your goal.
>
> HTH
>



Re: [slurm-users] Job step aborted

2018-05-17 Thread Matthieu Hautreux
Le jeu. 17 mai 2018 11:28, Mahmood Naderan  a écrit :

> Hi,
> For an interactive job via srun, I see that after opening the gui, the
> session is terminated automatically which is weird.
>
> [mahmood@rocks7 ansys_test]$ srun --x11 -A y8 -p RUBY --ntasks=10
> --mem=8GB --pty bash
> [mahmood@compute-0-6 ansys_test]$ /state/partition1/scfd/sc -t10
> srun: First task exited 60s ago
> srun: step:292.0 task 0: running
> srun: step:292.0 tasks 1-9: exited
> srun: Terminating job step 292.0
> srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
> srun: error: compute-0-6: task 0: Killed
>
> What does that mean?
>

It means what is written : your job is terminated because 9 tasks out of 10
exited more than 60s before.

The logic behind the 60 seconds (configurable) is described in the srun man
page. You should look at it closely.

You should also look at the FAQ here https://slurm.schedmd.com/faq.html.

You should set --ntask=1, if I properly guess your goal.

HTH



> Regards,
> Mahmood
>
>


Re: [slurm-users] Job step aborted

2018-05-17 Thread Mahmood Naderan
I have opened a bug ticket at https://bugs.schedmd.com/show_bug.cgi?id=5182
It is annoying...

Regards,
Mahmood




On Thu, May 17, 2018 at 1:54 PM, Mahmood Naderan  wrote:
> Hi,
> For an interactive job via srun, I see that after opening the gui, the
> session is terminated automatically which is weird.
>
> [mahmood@rocks7 ansys_test]$ srun --x11 -A y8 -p RUBY --ntasks=10
> --mem=8GB --pty bash
> [mahmood@compute-0-6 ansys_test]$ /state/partition1/scfd/sc -t10
> srun: First task exited 60s ago
> srun: step:292.0 task 0: running
> srun: step:292.0 tasks 1-9: exited
> srun: Terminating job step 292.0
> srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
> srun: error: compute-0-6: task 0: Killed
>
> What does that mean?
>
> Regards,
> Mahmood



[slurm-users] Job step aborted

2018-05-17 Thread Mahmood Naderan
Hi,
For an interactive job via srun, I see that after opening the gui, the
session is terminated automatically which is weird.

[mahmood@rocks7 ansys_test]$ srun --x11 -A y8 -p RUBY --ntasks=10
--mem=8GB --pty bash
[mahmood@compute-0-6 ansys_test]$ /state/partition1/scfd/sc -t10
srun: First task exited 60s ago
srun: step:292.0 task 0: running
srun: step:292.0 tasks 1-9: exited
srun: Terminating job step 292.0
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
srun: error: compute-0-6: task 0: Killed

What does that mean?

Regards,
Mahmood