[slurm-users] Slurm User Group Meeting (SLUG'20) Agenda Posted
The Slurm User Group Meeting (SLUG'20) this fall will be moving online. In lieu of an in-person meeting, SchedMD will broadcast a select set of presentations on Tuesday, September 15th, 2020, from 9am to noon (MDT). The agenda is now posted online at: https://slurm.schedmd.com/slurm_ug_agenda.html Links to the broadcasts will be added there when available, and an update will be sent to slurm-announce and slurm-users lists. - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support
Re: [slurm-users] Alternatives for MailProg
That is where you have it call a bash script and within the script you do as needed. Like Ahmet's suggested script. So use his as a template and add the headers you desire. Brian Andrus On 8/28/2020 11:36 AM, Chris Samuel wrote: On 8/27/20 3:42 pm, Brian Andrus wrote: Actually, you can add headers of all kinds: Quick search of "sendmail add headers" discovers: Problem is that Slurm doesn't directly call sendmail, it calls "mail" (or MailProg in your slurm.conf) instead, hence not being able to add headers. All the best, Chris
Re: [slurm-users] Alternatives for MailProg
On 8/27/20 3:42 pm, Brian Andrus wrote: Actually, you can add headers of all kinds: Quick search of "sendmail add headers" discovers: Problem is that Slurm doesn't directly call sendmail, it calls "mail" (or MailProg in your slurm.conf) instead, hence not being able to add headers. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] is there a way to delay the scheduling.
Sounds like you’re sort of the poster-child for this section of the documentation: https://slurm.schedmd.com/high_throughput.html — note that it’s possible for this to be version specific, so look for this file in the “archive” section of the website if you need other than 20.02. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Aug 28, 2020, at 6:30 AM, navin srivastava wrote: > > Hi Team, > > facing one issue. several users submitting 2 job in a single batch job > which is very short jobs( says 1-2 sec). so while submitting more job > slurmctld become unresponsive and started giving message > > ending job 6e508a88155d9bec40d752c8331d7ae8 to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6e51ed0e322c87802b0f3a2f23a7967f to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6e638939f90cd59e60c23b8450af9839 to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6e6acf36bc7e1394a92155a95feb1c92 to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6e6c646a29f0ad4e9df35001c367a9f5 to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6ebcecb4c27d88f0f48d402e2b079c52 to queue. > > even that time the load of cpu started consuming more than 100% of slurmctld > process. > I found that the node is not able to acknowledge immediately to server. it is > moving from comp to idle. > so in my thought delay a scheduling cycle will help here. any idea how it can > be done. > > so is there any other solution available for such issues. > > Regards > Navin. > > >
Re: [slurm-users] is there a way to delay the scheduling.
Hey, you can use the 'defer' scheduler parameter: https://slurm.schedmd.com/sched_config.html if you don't require immediate start of jobs. best regards Maciej Pawlik pt., 28 sie 2020 o 12:32 navin srivastava napisał(a): > Hi Team, > > facing one issue. several users submitting 2 job in a single batch job > which is very short jobs( says 1-2 sec). so while submitting more job > slurmctld become unresponsive and started giving message > > ending job 6e508a88155d9bec40d752c8331d7ae8 to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6e51ed0e322c87802b0f3a2f23a7967f to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6e638939f90cd59e60c23b8450af9839 to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6e6acf36bc7e1394a92155a95feb1c92 to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6e6c646a29f0ad4e9df35001c367a9f5 to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6ebcecb4c27d88f0f48d402e2b079c52 to queue. > > even that time the load of cpu started consuming more than 100% of > slurmctld process. > I found that the node is not able to acknowledge immediately to server. it > is moving from comp to idle. > so in my thought delay a scheduling cycle will help here. any idea how it > can be done. > > so is there any other solution available for such issues. > > Regards > Navin. > > > >
Re: [slurm-users] is there a way to delay the scheduling.
Seems if they are really that short, it would be better to have a single job run through them all, or 10 jobs run through 2000 each kind of thing. Such short jobs take more time for setup/teardown than the job itself, making this approach inefficient. The amount of resources used just to schedule them in that fashion outweighs the resources needed by far. Brian Andrus On 8/28/2020 3:30 AM, navin srivastava wrote: Hi Team, facing one issue. several users submitting 2 job in a single batch job which is very short jobs( says 1-2 sec). so while submitting more job slurmctld become unresponsive and started giving message ending job 6e508a88155d9bec40d752c8331d7ae8 to queue. sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure) Sending job 6e51ed0e322c87802b0f3a2f23a7967f to queue. sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure) Sending job 6e638939f90cd59e60c23b8450af9839 to queue. sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure) Sending job 6e6acf36bc7e1394a92155a95feb1c92 to queue. sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure) Sending job 6e6c646a29f0ad4e9df35001c367a9f5 to queue. sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure) Sending job 6ebcecb4c27d88f0f48d402e2b079c52 to queue. even that time the load of cpu started consuming more than 100% of slurmctld process. I found that the node is not able to acknowledge immediately to server. it is moving from comp to idle. so in my thought delay a scheduling cycle will help here. any idea how it can be done. so is there any other solution available for such issues. Regards Navin.
[slurm-users] is there a way to delay the scheduling.
Hi Team, facing one issue. several users submitting 2 job in a single batch job which is very short jobs( says 1-2 sec). so while submitting more job slurmctld become unresponsive and started giving message ending job 6e508a88155d9bec40d752c8331d7ae8 to queue. sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure) Sending job 6e51ed0e322c87802b0f3a2f23a7967f to queue. sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure) Sending job 6e638939f90cd59e60c23b8450af9839 to queue. sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure) Sending job 6e6acf36bc7e1394a92155a95feb1c92 to queue. sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure) Sending job 6e6c646a29f0ad4e9df35001c367a9f5 to queue. sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure) Sending job 6ebcecb4c27d88f0f48d402e2b079c52 to queue. even that time the load of cpu started consuming more than 100% of slurmctld process. I found that the node is not able to acknowledge immediately to server. it is moving from comp to idle. so in my thought delay a scheduling cycle will help here. any idea how it can be done. so is there any other solution available for such issues. Regards Navin.