[slurm-dev] KNL support in Slurm version 16.05.6

2016-10-27 Thread Morris Jette
Slurm version 16.05.6 also includes a new node_features/knl_generic plugin, which can allow regular users to modify NUMA and MCDRAM modes of KNL nodes. For more information see: http://slurm.schedmd.com/intel_knl.html On 2016-10-27 16:36, Danny Auble wrote: Slurm version 16.05.6 is now

[slurm-dev] Re: How to restart a job "(launch failed requeued held)"

2016-10-27 Thread Lachlan Musicman
On 28 October 2016 at 09:20, Christopher Samuel wrote: > > On 28/10/16 08:44, Lachlan Musicman wrote: > > > So I checked the system, noticed that one node was drained, resumed it. > > Then I tried both > > > > scontrol requeue 230591 > > scontrol resume 230591 > > What

[slurm-dev] Slurm versions 16.05.6 and 17.02.0-pre3 are now available

2016-10-27 Thread Danny Auble
Slurm version 16.05.6 is now available and includes around 40 bug fixes developed over the past month. We have also made the third pre-release of version 17.02, which is under development and scheduled for release in February 2017. Slurm downloads are available from

[slurm-dev] Re: How to restart a job "(launch failed requeued held)"

2016-10-27 Thread Christopher Samuel
On 28/10/16 08:44, Lachlan Musicman wrote: > So I checked the system, noticed that one node was drained, resumed it. > Then I tried both > > scontrol requeue 230591 > scontrol resume 230591 What happens if you "scontrol hold" it first before "scontrol release"'ing it? -- Christopher Samuel

[slurm-dev] How to restart a job "(launch failed requeued held)"

2016-10-27 Thread Lachlan Musicman
Morning, Yesterday we had some internal network issues that caused havoc on our system. By the end of the day everything was ok on the whole. This morning I came in to see one job on the queue (which was otherwise relatively quiet) with the error message/Nodelist Reason (launch failed requeued

[slurm-dev] Re: Requirement of no firewall on compute nodes?

2016-10-27 Thread Christopher Benjamin Coffey
Hi Ole, I don’t see a reason for a firewall to exist on a compute node, is it a requirement on your new cluster? If not, disable it. I don’t see Moe’s statement as saying that you can’t have a firewall, just that if there is one, you should open it up to allow all slurm communication. Best,

[slurm-dev] Re: slurm network address problem ?

2016-10-27 Thread Ole Holm Nielsen
You might want to check out my Wiki-page for setting up Slurm on CentOS 7.2: https://wiki.fysik.dtu.dk/niflheim/SLURM. Perhaps you'll solve the problem using this information? On 10/27/2016 04:14 PM, Mikhail Kuzminsky wrote: I worked w/PBS and SGE; now I'm beginner w/slurm, and installed

[slurm-dev] Requirement of no firewall on compute nodes?

2016-10-27 Thread Ole Holm Nielsen
In the process of developing our new cluster using Slurm, I've been bitten by the firewall settings on the compute nodes preventing MPI jobs from spawning tasks on remote nodes. I now believe that Slurm actually has a requirement that compute nodes must have their Linux firewall disabled.

[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

2016-10-27 Thread Ole Holm Nielsen
On 10/27/2016 09:42 AM, Loris Bennett wrote: So is restarting slurmctld the only way to let it pick up changes in slurm.conf? No. You can also do scontrol reconfigure This does not restart slurmctld. Question: How are the slurmd daemons notified about the changes in slurm.conf? Will

[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

2016-10-27 Thread Loris Bennett
Tuo Chen Peng writes: > I thought ‘scontrol update’ command is for letting slurmctld to pick up any > change in slurm.conf. > > But after reading the manual again, it seems this command is instead to change > the setting at runtime, instead of reading any change from

[slurm-dev] Re: Slurm license management question

2016-10-27 Thread Loris Bennett
Baker D.J. writes: > Hello, > > Looking at the Slurm documentation I see that it is possible to handle basic > license management (this is the link http://slurm.schedmd.com/licenses.html). > In > other words software licenses can be treated as a resource, however things

[slurm-dev] Re: Set Limit Time Per Job

2016-10-27 Thread Achi Hamza
Hi Benjamin Thank you for your response. In fact, i forgot that i set OverTimeLimit to 10 min, by which a job can exceed its time limit before being canceled. That is why the job runs beyond the time limit. Thank you again and regards, Hamza On 26 October 2016 at 23:11, Benjamin Redling