Re: [slurm-users] is there a way to delay the scheduling.

2020-08-28 Thread Ryan Novosielski
Sounds like you’re sort of the poster-child for this section of the 
documentation:

https://slurm.schedmd.com/high_throughput.html — note that it’s possible for 
this to be version specific, so look for this file in the “archive” section of 
the website if you need other than 20.02.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Aug 28, 2020, at 6:30 AM, navin srivastava  wrote:
> 
> Hi Team,
> 
> facing one issue. several users submitting 2 job in a single batch job 
> which is very short jobs( says 1-2 sec). so while submitting more job 
> slurmctld become unresponsive and started giving message
> 
> ending job 6e508a88155d9bec40d752c8331d7ae8 to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm 
> controller (connect failure)
> Sending job 6e51ed0e322c87802b0f3a2f23a7967f to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm 
> controller (connect failure)
> Sending job 6e638939f90cd59e60c23b8450af9839 to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm 
> controller (connect failure)
> Sending job 6e6acf36bc7e1394a92155a95feb1c92 to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm 
> controller (connect failure)
> Sending job 6e6c646a29f0ad4e9df35001c367a9f5 to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm 
> controller (connect failure)
> Sending job 6ebcecb4c27d88f0f48d402e2b079c52 to queue.
> 
> even that time the load of cpu started consuming more than 100%  of slurmctld 
> process.
> I found that the node is not able to acknowledge immediately to server. it is 
> moving from comp to idle.
> so in my thought delay a scheduling cycle will help here. any idea how it can 
> be done.
> 
> so is there any other solution available for such issues.
> 
> Regards
> Navin.
> 
> 
> 



Re: [slurm-users] is there a way to delay the scheduling.

2020-08-28 Thread Maciej Pawlik
Hey,

you can use the 'defer' scheduler parameter:
https://slurm.schedmd.com/sched_config.html if you don't require immediate
start of jobs.

best regards
Maciej Pawlik

pt., 28 sie 2020 o 12:32 navin srivastava 
napisał(a):

> Hi Team,
>
> facing one issue. several users submitting 2 job in a single batch job
> which is very short jobs( says 1-2 sec). so while submitting more job
> slurmctld become unresponsive and started giving message
>
> ending job 6e508a88155d9bec40d752c8331d7ae8 to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm
> controller (connect failure)
> Sending job 6e51ed0e322c87802b0f3a2f23a7967f to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm
> controller (connect failure)
> Sending job 6e638939f90cd59e60c23b8450af9839 to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm
> controller (connect failure)
> Sending job 6e6acf36bc7e1394a92155a95feb1c92 to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm
> controller (connect failure)
> Sending job 6e6c646a29f0ad4e9df35001c367a9f5 to queue.
> sbatch: error: Batch job submission failed: Unable to contact slurm
> controller (connect failure)
> Sending job 6ebcecb4c27d88f0f48d402e2b079c52 to queue.
>
> even that time the load of cpu started consuming more than 100%  of
> slurmctld process.
> I found that the node is not able to acknowledge immediately to server. it
> is moving from comp to idle.
> so in my thought delay a scheduling cycle will help here. any idea how it
> can be done.
>
> so is there any other solution available for such issues.
>
> Regards
> Navin.
>
>
>
>


Re: [slurm-users] is there a way to delay the scheduling.

2020-08-28 Thread Brian Andrus
Seems if they are really that short, it would be better to have a single 
job run through them all, or 10 jobs run through 2000 each kind of thing.


Such short jobs take more time for setup/teardown than the job itself, 
making this approach inefficient. The amount of resources used just to 
schedule them in that fashion outweighs the resources needed by far.


Brian Andrus

On 8/28/2020 3:30 AM, navin srivastava wrote:

Hi Team,

facing one issue. several users submitting 2 job in a single batch 
job which is very short jobs( says 1-2 sec). so while submitting more 
job slurmctld become unresponsive and started giving message


ending job 6e508a88155d9bec40d752c8331d7ae8 to queue.
sbatch: error: Batch job submission failed: Unable to contact slurm 
controller (connect failure)

Sending job 6e51ed0e322c87802b0f3a2f23a7967f to queue.
sbatch: error: Batch job submission failed: Unable to contact slurm 
controller (connect failure)

Sending job 6e638939f90cd59e60c23b8450af9839 to queue.
sbatch: error: Batch job submission failed: Unable to contact slurm 
controller (connect failure)

Sending job 6e6acf36bc7e1394a92155a95feb1c92 to queue.
sbatch: error: Batch job submission failed: Unable to contact slurm 
controller (connect failure)

Sending job 6e6c646a29f0ad4e9df35001c367a9f5 to queue.
sbatch: error: Batch job submission failed: Unable to contact slurm 
controller (connect failure)

Sending job 6ebcecb4c27d88f0f48d402e2b079c52 to queue.

even that time the load of cpu started consuming more than 100%  of 
slurmctld process.
I found that the node is not able to acknowledge immediately to 
server. it is moving from comp to idle.
so in my thought delay a scheduling cycle will help here. any idea how 
it can be done.


so is there any other solution available for such issues.

Regards
Navin.





[slurm-users] is there a way to delay the scheduling.

2020-08-28 Thread navin srivastava
Hi Team,

facing one issue. several users submitting 2 job in a single batch job
which is very short jobs( says 1-2 sec). so while submitting more job
slurmctld become unresponsive and started giving message

ending job 6e508a88155d9bec40d752c8331d7ae8 to queue.
sbatch: error: Batch job submission failed: Unable to contact slurm
controller (connect failure)
Sending job 6e51ed0e322c87802b0f3a2f23a7967f to queue.
sbatch: error: Batch job submission failed: Unable to contact slurm
controller (connect failure)
Sending job 6e638939f90cd59e60c23b8450af9839 to queue.
sbatch: error: Batch job submission failed: Unable to contact slurm
controller (connect failure)
Sending job 6e6acf36bc7e1394a92155a95feb1c92 to queue.
sbatch: error: Batch job submission failed: Unable to contact slurm
controller (connect failure)
Sending job 6e6c646a29f0ad4e9df35001c367a9f5 to queue.
sbatch: error: Batch job submission failed: Unable to contact slurm
controller (connect failure)
Sending job 6ebcecb4c27d88f0f48d402e2b079c52 to queue.

even that time the load of cpu started consuming more than 100%  of
slurmctld process.
I found that the node is not able to acknowledge immediately to server. it
is moving from comp to idle.
so in my thought delay a scheduling cycle will help here. any idea how it
can be done.

so is there any other solution available for such issues.

Regards
Navin.