[slurm-dev] options for relocating the primary slurm controller, slurmctld?
Hi, I have some new hardware I am hoping to move the slurmctld to and am wondering if I need to haul along the IP address or if there is some other sneaky method. Has anyone done this recently and been able to keep jobs from going to CG state? I was testing this out with a dev cluster, started some jobs, shutdown ctld/slurmd and updated the config file and started things backup. I then noticed squeue showing jobs in CG state and upon logging in to the corresponding compute nodes found that slurmstepd processes were in the process list. An strace on them showed they were trying to contact the old controller for the rpc for job complete. Maybe I messed up the order of the operations. Or maybe I really should haul along the IP. Would slurmstepd fall back to the backup controller in the slurm.conf file it read on starting that job (if I had one defined)? e.g. Could I define the new IP as the backup, roll that out, wait until some point when all jobs running prior to the change completed and then switch the primary and backup controllers in the slurm.conf file? Is this up to date? http://slurm.schedmd.com/faq.html#controller *2. How should I relocate the primary or backup controller?* If the cluster's computers used for the primary or backup controller will be out of service for an extended period of time, it may be desirable to relocate them. In order to do so, follow this procedure: 1. Stop all Slurm daemons 2. Modify the *ControlMachine*, *ControlAddr*, *BackupController*, and/or *BackupAddr* in the *slurm.conf* file 3. Distribute the updated *slurm.conf* file to all nodes 4. Copy the *StateSaveLocation* directory to the new host and make sure the permissions allow the *SlurmUser* to read and write it. 5. Restart all Slurm daemons There should be no loss of any running or pending jobs. Insure that any nodes added to the cluster have a current *slurm.conf* file installed. *CAUTION:* If two nodes are simultaneously configured as the primary controller (two nodes on which *ControlMachine* specify the local host and the *slurmctld* daemon is executing on each), system behavior will be destructive. If a compute node has an incorrect *ControlMachine* or *BackupController* parameter, that node may be rendered unusable, but no other harm will result.
[slurm-dev] login node configuration?
Hello, I’m in the process of setting up SLURM 15.08.X for the first time. I’ve got a head node and ten compute nodes working fine for serial and parallel jobs. I’m struggling with figuring out how to configure a pair of login nodes to be able to get status and submit jobs. I’ve been searching through the documentation and googling, but I’ve not found a reference other than brief mentions of login nodes or submit hosts. Our pair of login nodes share an ethernet network with the head node. They do not, currently, share a network, ethernet or IB, with the compute nodes. When I execute SLURM commands like sinfo or squeue, using strace I can see them attempting to communicate with port 6817 on the head node. However, the head node doesn’t respond. lsof on the head node seems to indicate slurmctld is only listening on the private network shared with the compute nodes. Is there a reference for configuring login nodes to be able to query slurmctld and submit jobs? If not, would somebody with a similar configuration be willing to share some guidance? Do the login nodes need to be on the same network as the compute nodes, or at least the same network as slurmctld is listening on? Apologies if these questions have been answered somewhere else, I just haven’t found that documentation. Regards, -liam -There are uncountably more irrational fears than rational ones. -P. Dolan Liam Forbes lofor...@alaska.edu ph: 907-450-8618 fax: 907-450-8601 UAF Research Computing Systems Senior HPC EngineerLPIC1, CISSP
[slurm-dev] Re: login node configuration?
For clarity, they should not need to talk to the compute nodes unless you intend to do interactive work. You should only need to talk to the master to submit jobs. -Paul Edmon- On 10/26/2015 9:45 PM, Paul Edmon wrote: What we did was that we just opened up port 6817 between the two VLAN's. So long as the traffic is routeable and they see the same slurm.conf that should work. All the login node needs is slurm.conf, and the slurm, slurm-munge, and slurm-plugin rpms. You don't need to run the slurm service to submit as all you need is the ability of the login node to talk to the master. -Paul Edmon- On 10/26/2015 8:02 PM, Liam Forbes wrote: Hello, I’m in the process of setting up SLURM 15.08.X for the first time. I’ve got a head node and ten compute nodes working fine for serial and parallel jobs. I’m struggling with figuring out how to configure a pair of login nodes to be able to get status and submit jobs. I’ve been searching through the documentation and googling, but I’ve not found a reference other than brief mentions of login nodes or submit hosts. Our pair of login nodes share an ethernet network with the head node. They do not, currently, share a network, ethernet or IB, with the compute nodes. When I execute SLURM commands like sinfo or squeue, using strace I can see them attempting to communicate with port 6817 on the head node. However, the head node doesn’t respond. lsof on the head node seems to indicate slurmctld is only listening on the private network shared with the compute nodes. Is there a reference for configuring login nodes to be able to query slurmctld and submit jobs? If not, would somebody with a similar configuration be willing to share some guidance? Do the login nodes need to be on the same network as the compute nodes, or at least the same network as slurmctld is listening on? Apologies if these questions have been answered somewhere else, I just haven’t found that documentation. Regards, -liam -There are uncountably more irrational fears than rational ones. -P. Dolan Liam Forbes lofor...@alaska.edu ph: 907-450-8618 fax: 907-450-8601 UAF Research Computing Systems Senior HPC Engineer LPIC1, CISSP
[slurm-dev] Re: login node configuration?
What we did was that we just opened up port 6817 between the two VLAN's. So long as the traffic is routeable and they see the same slurm.conf that should work. All the login node needs is slurm.conf, and the slurm, slurm-munge, and slurm-plugin rpms. You don't need to run the slurm service to submit as all you need is the ability of the login node to talk to the master. -Paul Edmon- On 10/26/2015 8:02 PM, Liam Forbes wrote: Hello, I’m in the process of setting up SLURM 15.08.X for the first time. I’ve got a head node and ten compute nodes working fine for serial and parallel jobs. I’m struggling with figuring out how to configure a pair of login nodes to be able to get status and submit jobs. I’ve been searching through the documentation and googling, but I’ve not found a reference other than brief mentions of login nodes or submit hosts. Our pair of login nodes share an ethernet network with the head node. They do not, currently, share a network, ethernet or IB, with the compute nodes. When I execute SLURM commands like sinfo or squeue, using strace I can see them attempting to communicate with port 6817 on the head node. However, the head node doesn’t respond. lsof on the head node seems to indicate slurmctld is only listening on the private network shared with the compute nodes. Is there a reference for configuring login nodes to be able to query slurmctld and submit jobs? If not, would somebody with a similar configuration be willing to share some guidance? Do the login nodes need to be on the same network as the compute nodes, or at least the same network as slurmctld is listening on? Apologies if these questions have been answered somewhere else, I just haven’t found that documentation. Regards, -liam -There are uncountably more irrational fears than rational ones. -P. Dolan Liam Forbes lofor...@alaska.edu ph: 907-450-8618 fax: 907-450-8601 UAF Research Computing Systems Senior HPC EngineerLPIC1, CISSP
[slurm-dev] Re: Slurm, job preemption, and swap.
It would be easy if there was a way to force TRES allocation/reconfiguration, e.g. Add the swap as GRES/swap, and on suspend transfer the allocation from TRES=mem=64,GRES/swap=0 to TRES=mem=0,GRES/swap=64. Then you could start the new job which requires available mem. Will it be possible to add such a mechanism to scontrol update ? On 10/23/2015 03:14 AM, Bill Broadley wrote: I've been using the example documented at: http://slurm.schedmd.com/preempt.html Specifically Excerpt from slurm.conf PartitionName=low Nodes=linux Default=YES Shared=NO Priority=10 PreemptMode=requeue PartitionName=med Nodes=linux Default=NO Shared=FORCE:1 Priority=20 PreemptMode=suspend PartitionName=high Nodes=linux Default=NO Shared=FORCE:1 Priority=30 PreemptMode=off All my compute nodes have at least as much swap as ram. This works quite well, so high priority jobs can suspend medium priority jobs and if there's memory pressure on the node suspended jobs can pushed to swap. I enforce the memory limits so jobs using more ram than they ask for get killed. With the slurm 2.6.5 to 14.11 upgrade slurm added the ability so manage memory limits as well as CPU. So I started adding GrpMemory to users so if they purchase 4 nodes they can allocate a total of 4 nodes of CPUs or 4 nodes of memory in the high priority queue. So I have entries like: User-'test':Partition='high':DefaultAccount='testgrp':GrpCPUs=128:GrpMemory=256000 So I set DefMemPerCPU=2000, so that users who do not ask for a specific memory allocation they get 2GB per CPU. My nodes have 64GB ram and 32 CPUs. This works quite well, but it broke preemption. So now if I'm running 32 2GB jobs in the medium queue, no high priority jobs can run because all ram is allocated. That seems quite weird to me, if a job is SIGSTOP'd to suspend any memory pressure should force suspended memory pages into swap. Given that the suspended job isn't running that shouldn't cause too much I/O since each page is written just once, no churning. Is there any way to get slurm to not count suspended jobs memory allocation towards the node's memory used total? Any suggestions on how to get the old behavior back where high priority jobs can be suspended?
[slurm-dev] Re: salloc timeout question
Kyle Sternwrites: > > > > > > > > > > When I set a time limit for my salloc, why do I still have to exit from it once the allocation is revoked? For example: > > salloc -t1 > salloc: Granted job allocation 112645 > > salloc: Job 112645 has exceeded its time limit and its allocation has been revoked. > > ^C > > echo $SLURM_JOBID112645 > > exit > > echo $SLURM_JOBID> > I'm using v14.11.8. > > > Hi Kyle, Maybe the following can help answer your question. Just my guess, I haven't tested this. >From the salloc man page, --time (-t) option: "When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL." Notice that it says "each task in each job step". Your shell that gets created as a result of your salloc command is not a task in a job step. If you used the method described in the FAQ to launch your shell then I would guess the salloc shell would also be terminated. See: http://slurm.schedmd.com/faq.html#salloc_default_command The man page for scancel might also help explain the difference in terminating a job step vs the batch command. Especially the "--batch" and "--signal" options and the whole "ARGUMENTS" section. Deric
[slurm-dev] Re: Slurm, job preemption, and swap.
We ran into this swap issue when using SelectTypeParameters=CR_CPU_Memory. We are still on 14.03.10 and have a very ugly hack that adds a SchedulerParameter of "assume_swap" which basically forces SLURM to ignore memory allocations of swapped jobs. The patch was very rushed so likely we ended up just making SLURM behave like it's configured with CR_CPU instead of CR_CPU_Memory. When we upgrade to 15.08.x we will be using CR_CPU without our patch since we define MaxMemoryPerCPU on all partitions. So far in testing, CR_CPU and MaxMemoryPerCPU results in behavior where a 64GB node can have 64GB worth of suspended jobs and still run 64GB worth of active jobs. If a user requests 1 CPU and 64GB with MaxMemoryPerCPU=2000, they end up with 32 CPUs which we use for QOS resource limits and accounting. Attached are the patches. They likely only work on 14.03.x releases. I wouldn't recommend using the patches, but they may give an idea of how to implement a proper solution that is worthy of being submitted for inclusion in SLURM. - Trey = Trey Dockendorf Systems Analyst I Texas A University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu On Mon, Oct 26, 2015 at 9:06 AM, Daniel Letaiwrote: > > It would be easy if there was a way to force TRES > allocation/reconfiguration, e.g. > Add the swap as GRES/swap, and on suspend transfer the allocation from > TRES=mem=64,GRES/swap=0 to TRES=mem=0,GRES/swap=64. Then you could start > the new job which requires available mem. > > Will it be possible to add such a mechanism to scontrol update ? > > On 10/23/2015 03:14 AM, Bill Broadley wrote: > >> I've been using the example documented at: >>http://slurm.schedmd.com/preempt.html >> >> Specifically Excerpt from slurm.conf >> PartitionName=low Nodes=linux Default=YES Shared=NO Priority=10 >> PreemptMode=requeue >> PartitionName=med Nodes=linux Default=NO Shared=FORCE:1 Priority=20 >> PreemptMode=suspend >> PartitionName=high Nodes=linux Default=NO Shared=FORCE:1 Priority=30 >> PreemptMode=off >> >> All my compute nodes have at least as much swap as ram. This works quite >> well, >> so high priority jobs can suspend medium priority jobs and if there's >> memory >> pressure on the node suspended jobs can pushed to swap. I enforce the >> memory >> limits so jobs using more ram than they ask for get killed. With the >> slurm >> 2.6.5 to 14.11 upgrade slurm added the ability so manage memory limits as >> well >> as CPU. >> >> So I started adding GrpMemory to users so if they purchase 4 nodes they >> can >> allocate a total of 4 nodes of CPUs or 4 nodes of memory in the high >> priority >> queue. So I have entries like: >> >> User-'test':Partition='high':DefaultAccount='testgrp':GrpCPUs=128:GrpMemory=256000 >> >> So I set DefMemPerCPU=2000, so that users who do not ask for a specific >> memory >> allocation they get 2GB per CPU. My nodes have 64GB ram and 32 CPUs. >> This >> works quite well, but it broke preemption. >> >> So now if I'm running 32 2GB jobs in the medium queue, no high priority >> jobs can >> run because all ram is allocated. That seems quite weird to me, if a job >> is >> SIGSTOP'd to suspend any memory pressure should force suspended memory >> pages >> into swap. Given that the suspended job isn't running that shouldn't >> cause too >> much I/O since each page is written just once, no churning. >> >> Is there any way to get slurm to not count suspended jobs memory >> allocation >> towards the node's memory used total? >> >> Any suggestions on how to get the old behavior back where high priority >> jobs can >> be suspended? >> > 0001-select_cons_res.c.patch Description: Binary data 0002-add-assume_swap-config-option.patch Description: Binary data
[slurm-dev] job_submit/lua plugin: Fix DefaultQOS lookup
Attached is a patch to resolve an issue when a user does not exist in the accounting database (to prevent job submission), a job_submit.lua that uses the default QOS lookup will segfault. I was unsure if using 'accounting_enforce' was the correct method, or just checking if assoc_ptr was NULL. Thanks, - Trey = Trey Dockendorf Systems Analyst I Texas A University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu job_submit-default_qos-fix-15.08.patch Description: Binary data