[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes
Hi Jin, I think that I always do your steps 3,4 in the opposite order: Restart slurmctld, then slurmd on nodes: > 3. Restart the slurmd on all nodes > 4. Restart the slurmctld Since you run a very old Slurm 15.08, perhaps you should upgrade 15.08 -> 16.05 -> 17.02. Soon there will be a 17.11. FYI: I wrote some notes about upgrading: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm /Ole On 10/23/2017 02:55 PM, JinSung Kang wrote: Hi Thanks everyone for your response. I have also tested my setup to remove nodes from the cluster, and the same thing happens. *To answer some of the previous questions.* "Node compute004 appears to have a different slurm.conf than the slurmctld" error comes up when I replace slurm.conf in all the devices, but it goes away when I restart slurmctld. slurm version that I'm running is slurm 15.08.7 I've included the slurm.conf rather than slurmdbd.conf. Cheers, Jin On Mon, Oct 23, 2017 at 8:25 AM Ole Holm Nielsen> wrote: Hi Jin, Your slurmctld.log says "Node compute004 appears to have a different slurm.conf than the slurmctld" etc. This will happen if you didn't copy correctly the slurm.conf to the nodes. Please correct this potential error. Also, please specify which version of Slurm you're running. /Ole On 10/22/2017 08:44 PM, JinSung Kang wrote: > I am having trouble with adding new nodes into slurm cluster without > killing the jobs that are currently running. > > Right now I > > 1. Update the slurm.conf and add a new node to it > 2. Copy new slurm.conf to all the nodes, > 3. Restart the slurmd on all nodes > 4. Restart the slurmctld > > But when I restart slurmctld all the jobs that were currently running > are requeued (Begin Time) as reason for not running. The new added node > works perfectly fine. > > I've included the slurm.conf. I've also included slurmctld.log output > when I'm trying to add the new node.
[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes
Hi Thanks everyone for your response. I have also tested my setup to remove nodes from the cluster, and the same thing happens. *To answer some of the previous questions.* "Node compute004 appears to have a different slurm.conf than the slurmctld" error comes up when I replace slurm.conf in all the devices, but it goes away when I restart slurmctld. slurm version that I'm running is slurm 15.08.7 I've included the slurm.conf rather than slurmdbd.conf. Cheers, Jin On Mon, Oct 23, 2017 at 8:25 AM Ole Holm Nielsenwrote: > > Hi Jin, > > Your slurmctld.log says "Node compute004 appears to have a different > slurm.conf than the slurmctld" etc. This will happen if you didn't copy > correctly the slurm.conf to the nodes. Please correct this potential > error. > > Also, please specify which version of Slurm you're running. > > /Ole > > On 10/22/2017 08:44 PM, JinSung Kang wrote: > > I am having trouble with adding new nodes into slurm cluster without > > killing the jobs that are currently running. > > > > Right now I > > > > 1. Update the slurm.conf and add a new node to it > > 2. Copy new slurm.conf to all the nodes, > > 3. Restart the slurmd on all nodes > > 4. Restart the slurmctld > > > > But when I restart slurmctld all the jobs that were currently running > > are requeued (Begin Time) as reason for not running. The new added node > > works perfectly fine. > > > > I've included the slurm.conf. I've also included slurmctld.log output > > when I'm trying to add the new node. > slurm.conf Description: Binary data
[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes
Hi Jin, Your slurmctld.log says "Node compute004 appears to have a different slurm.conf than the slurmctld" etc. This will happen if you didn't copy correctly the slurm.conf to the nodes. Please correct this potential error. Also, please specify which version of Slurm you're running. /Ole On 10/22/2017 08:44 PM, JinSung Kang wrote: I am having trouble with adding new nodes into slurm cluster without killing the jobs that are currently running. Right now I 1. Update the slurm.conf and add a new node to it 2. Copy new slurm.conf to all the nodes, 3. Restart the slurmd on all nodes 4. Restart the slurmctld But when I restart slurmctld all the jobs that were currently running are requeued (Begin Time) as reason for not running. The new added node works perfectly fine. I've included the slurm.conf. I've also included slurmctld.log output when I'm trying to add the new node.
[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes
Ole Holm Nielsenwrites: > I have added nodes to an existing partition several times using the same > procedure which you describe, and no bad side effects have been noticed. This > is a very normal kind of operation in a cluster, where hardware may be added > or retired from time to time, while the cluster of course continues its normal > production. We must be able to do this, especially when transferring existing > nodes into a new Slurm cluster. I too have done the same a lot of times, and never seen any problem like this. > Douglas Jacobsen explained very well why problems may arise. It seems to me > that this completely rigid nodelist bit mask in the network is a Slurm design > problem, and that it ought to be fixed. The bitmask design is for speed, and given the problem of getting the backfiller to be fast enough under certain loads (lots of small, distributed jobs running, and a long queue of pending jobs), I personally wouldn't want schedmd to sacrifice that for making updates of node lists easier. Especially since I haven't seen the problem JinSung Kang reports. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes
I have added nodes to an existing partition several times using the same procedure which you describe, and no bad side effects have been noticed. This is a very normal kind of operation in a cluster, where hardware may be added or retired from time to time, while the cluster of course continues its normal production. We must be able to do this, especially when transferring existing nodes into a new Slurm cluster. Douglas Jacobsen explained very well why problems may arise. It seems to me that this completely rigid nodelist bit mask in the network is a Slurm design problem, and that it ought to be fixed. Question: How can we pinpoint the problem more precisely in a bug report to SchedMD (for support-customers only :-). /Ole On 10/22/2017 08:44 PM, JinSung Kang wrote: I am having trouble with adding new nodes into slurm cluster without killing the jobs that are currently running. Right now I 1. Update the slurm.conf and add a new node to it 2. Copy new slurm.conf to all the nodes, 3. Restart the slurmd on all nodes 4. Restart the slurmctld But when I restart slurmctld all the jobs that were currently running are requeued (Begin Time) as reason for not running. The new added node works perfectly fine. I've included the slurm.conf. I've also included slurmctld.log output when I'm trying to add the new node.
[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes
A workaround is to pre-configure future nodes and mark them as down - then when you add them you can just mark them as up. (see the DownNodes parameter) Hope this helps! Merlin -- Merlin Hartley Computer Officer MRC Mitochondrial Biology Unit Cambridge, CB2 0XY United Kingdom > On 22 Oct 2017, at 19:55, Douglas Jacobsenwrote: > > You cannot change the nodelist without draining the system of running jobs > (terminating all slurmstepd) and restarting all slurmd and slurmctld. This > is because slurm uses a bit mask to represent the nodelist, and slurm uses a > hierarchical overlay communication network. If all daemons don't have the > same idea of that network you can run into communication problems which can > cause nodes to be marked down, killing the jobs running upon them. > > I think if you are not using message aggregation, you might be able to get > away with leaving jobs running and just restarting all slurmd and slurmctld. > But the tricky thing is you'll need to quiesce a lot of the rpcs on the > system which can partially be done by marking partitions down, but not > completely. > > If you are thinking of adding nodes, I think you should look at the future > state that nodes can take. I haven't played with this, but I suspect it might > buy you some flexibility. > > On Oct 22, 2017 11:43, "JinSung Kang" wrote: > Hello, > > I am having trouble with adding new nodes into slurm cluster without killing > the jobs that are currently running. > > Right now I > > 1. Update the slurm.conf and add a new node to it > 2. Copy new slurm.conf to all the nodes, > 3. Restart the slurmd on all nodes > 4. Restart the slurmctld > > But when I restart slurmctld all the jobs that were currently running are > requeued (Begin Time) as reason for not running. The new added node works > perfectly fine. > > I've included the slurm.conf. I've also included slurmctld.log output when > I'm trying to add the new node. > > Cheers, > > Jin
[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes
You cannot change the nodelist without draining the system of running jobs (terminating all slurmstepd) and restarting all slurmd and slurmctld. This is because slurm uses a bit mask to represent the nodelist, and slurm uses a hierarchical overlay communication network. If all daemons don't have the same idea of that network you can run into communication problems which can cause nodes to be marked down, killing the jobs running upon them. I think if you are not using message aggregation, you might be able to get away with leaving jobs running and just restarting all slurmd and slurmctld. But the tricky thing is you'll need to quiesce a lot of the rpcs on the system which can partially be done by marking partitions down, but not completely. If you are thinking of adding nodes, I think you should look at the future state that nodes can take. I haven't played with this, but I suspect it might buy you some flexibility. On Oct 22, 2017 11:43, "JinSung Kang"wrote: > Hello, > > I am having trouble with adding new nodes into slurm cluster without > killing the jobs that are currently running. > > Right now I > > 1. Update the slurm.conf and add a new node to it > 2. Copy new slurm.conf to all the nodes, > 3. Restart the slurmd on all nodes > 4. Restart the slurmctld > > But when I restart slurmctld all the jobs that were currently running are > requeued (Begin Time) as reason for not running. The new added node works > perfectly fine. > > I've included the slurm.conf. I've also included slurmctld.log output when > I'm trying to add the new node. > > Cheers, > > Jin >