[slurm-dev] Re: Send notification email
If I understand your question, you can set it in the in slurm.conf file, the default is: MailProg = /usr/bin/mail From: Fanny Pagés Díaz> Reply-To: slurm-dev > Date: Wednesday, September 28, 2016 at 11:45 AM To: slurm-dev > Subject: [slurm-dev] Send notification email I need send notification email from Slurm using other mail server which is not the standard one. Any can help me?
[slurm-dev] Re: slum in the nodes not working
Make sure the slurm.conf file is identical on all nodes. If the slurmctld is running , and all the slurmd’s are running take a look at the slurmctld.log, it should provide some clues, if not you might want to post the content of your slurm.conf file. Phil Eckert LLNL From: Fany Pagés Díaz> Reply-To: slurm-dev > Date: Monday, December 21, 2015 at 12:39 PM To: slurm-dev > Subject: [slurm-dev] Re: slum in the nodes not working When I start the server, the nodes was down, I start /etc/init.d/slurm en in the server and it´s fine, but in the nodes are down. I restart the nodes again and nothing. any idea? De: Carlos Fenoy [mailto:mini...@gmail.com] Enviado el: lunes, 21 de diciembre de 2015 12:59 Para: slurm-dev Asunto: [slurm-dev] Re: slum in the nodes not working You should not start the slurmctld on all the nodes, only in the head node of the cluster, and in the compute nodes start the slurmd with service slurm start On Mon, Dec 21, 2015 at 6:27 PM, Fany Pagés Díaz > wrote: I had to turn off my cluster by electricity problems, and now slurm not working. The nodes are down and the demons of slurm in the nodes fails. When I run in the slurmctld -D command nodes, I get the following error: slurmctld: error: this host (compute-0-0) not valid controller (cluster or (null)) How can I fix that? any can help me, please? Ing. Fany Pages Diaz -- -- Carles Fenoy
[slurm-dev] Re: How can I send a mail when I finished a job?
I believe that all that is happening in regard to mail, is that the slurmctld is executing the mail utility, with the standard arguments. Is mail set up on the node the slurmctld is running on? A quick test would be to login there and manually send yourself email. Phil Eckert LLNL On 12/18/15, 9:42 AM, "Fany Pagés Díaz"wrote: > >I send my job like this: > >salloc -n 2 -N 2 --gres=gpu:2 --mail-type=ALL --mail-user=fpa...@citi.cu >mpirun job1 > >The job finished fine, but never send the email. I don¹t have to do >anything for the slurm know how send the email? > >-Mensaje original- >De: Wiegand, Paul [mailto:wieg...@ist.ucf.edu] >Enviado el: viernes, 18 de diciembre de 2015 12:00 >Para: slurm-dev >Asunto: [slurm-dev] Re: How can I send a mail when I finished a job? > >You have to tell it which events you want to receive email about, too. >Like this in your submit script: > >#SBATCH --mail-type=FAIL >#SBATCH --mail-type=BEGIN >#SBATCH --mail-type=END >#SBATCH --mail-user myem...@address.net > > > > >> On Dec 18, 2015, at 11:26, Fany Pagés Díaz wrote: >> >> I need to know the status of the work, but I used the mail-user >>=myemail parameter but not working. I have to do some configuration on >>the server? >> >> Any can help me? >> Ing. Fany Pagés Díaz
[slurm-dev] Re: A floating exclusive partition
A possibility might be to do this using reservations. You could create a 5 node reservation with all concerned users having access, then have a script run by cron that periodically checks the state of the node in the reservation, if any go down update the reservation replacing the down nodes with up nodes. If there are no up nodes determine the soonest a node will be free and add it to the reservation using the IGNORE_JOBS flag. Phil Eckert LLNL On 11/19/15, 8:09 AM, "Paul Edmon"wrote: > >Yeah, I guess QoS won't really work for overflow. I was more thinking >of the QoS as a way to create a floating partition of 5 nodes with the >rest being in the public queue. They would send jobs to the QoS to hit >that and then when it is full they would submit to public as normal. >That's at least my thinking, but it's less seamless to the users as they >will have to consciously monitor what is going on. > >-Paul Edmon- > >On 11/19/2015 10:50 AM, Daniel Letai wrote: >> >> Can you elaborate a little? I'm not sure what kind of QoS will help, >> nor how to implement one that will satisfy the requirements. >> >> On 11/19/2015 04:52 PM, Paul Edmon wrote: >>> >>> You might consider a QoS for this. It may not do everything you want >>> but it will give you the flexibility. >>> >>> -Paul Edmon- >>> >>> On 11/19/2015 04:49 AM, Daniel Letai wrote: Hi, Suppose I have a 100 node cluster with ~5% nodes down at any given time (maintanence/hw failure/...). One of the projects requires exclusive use of 5 nodes, and be able to use entire cluster when available (when other projects aren't running). I can do this easily if I maintain a static list of the exclusive nodes in slurm.conf: PartitionName=public Nodes=tux0[01-95] Default=YES PartitionName=special Nodes=tux[001-100] Default=NO And allowing only that project to use partition special. However, due to the downtime of 5%, I'd like to maintain a dynamic exclusive 5 nodes. Any suggestions? The project is serial and deployed as array of single node jobs, so I can run it even when the other 95 nodes are full. Thanks, --Dani_L.
[slurm-dev] Re: User Control of WallTime for running job
The reason this hss a higher permission level is that a user could game the system by submitting a job with a 1 minute time limit, which will generally get it started very quickly because of backfill, then they could increase it to whatever they wanted. I believe almost all batch system disallow this. Phil Eckert LLNL From: Jay Sullivan> Reply-To: slurm-dev > Date: Tuesday, November 17, 2015 at 10:51 AM To: slurm-dev > Subject: [slurm-dev] User Control of WallTime for running job Hello, I apologize if I missed the answer on how to do this, but I am hoping there is a way. Scenario: A job is in the RUN state, and the job is taking longer than expected. The user needs to increase the wall time of the job, to allow it to complete. The user cannot increase the wall time, because they do not have “operator” or “admin” privileges. For many reasons, I do not want to give even “operator” control to all users, just to give them the ability to adjust their wall time. So a few questions: 1) Is there a way to do this with the stock configuration? 2) If 1 is not possible is there a way to add a custom AdminLevel? One where I can set just the commands that users have access to? 3) If neither of these are possible, can we file an RFE? Thanks, -Jay Jay Sullivan HPC Systems Administrator Office: 310-970-3866 Mobile: 424-255-2713
[slurm-dev] Re: Requested node configuration is not available when using -c
Mike, In your slurm.conf you have Procs=1, (which is the same as CPUS=1) and Sockets (if ommited will be inferred from CPUS, default is 1) and CoresPerSocket (default is 1) So at this point the slurm.conf has a default configuration of 1 core per node. Phil Eckert LLNL From: Michal Zielinski michal.zielin...@uconn.edumailto:michal.zielin...@uconn.edu Reply-To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Date: Tuesday, September 9, 2014 at 6:35 AM To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Subject: [slurm-dev] Re: Requested node configuration is not available when using -c Josh, I believe that -n sets the number of tasks. I only want a single task, as when a single process uses multiple cores. srun -n 2 hostname returns linux-slurm2 linux-slurm3 which is definitely not what I want. Thanks, Mike On Mon, Sep 8, 2014 at 8:07 PM, Josh McSavaney mcsa...@csh.rit.edumailto:mcsa...@csh.rit.edu wrote: I believe your slurm.conf is defining 4 nodes with a single logical processor each. You are then trying to allocate two CPUs on a single node with srun, which (according to your slurm.conf) you do not have. You may want to consider `srun -n 2 hostname` and see where that lands you. Regards, Josh McSavaney Bit Flipper Rochester Institute of Technology On Mon, Sep 8, 2014 at 7:42 PM, Christopher Samuel sam...@unimelb.edu.aumailto:sam...@unimelb.edu.au wrote: On 09/09/14 07:26, Michal Zielinski wrote: I have a small test cluster (node[1-4]) running slurm 14.03.0 setup with CR_CPU and no usage restrictions. Each node has just 1 CPU. [...] But, *srun -c 2 hostname* does not work, and it returns the above error. I have no idea why I can't dedicate 2 cores to a single job if I can dedicate each core individually to a job. What does scontrol show node say? cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.aumailto:sam...@unimelb.edu.au Phone: +61 (0)3 903 55545tel:%2B61%20%280%293%20903%2055545 http://www.vlsci.org.au/ http://twitter.com/vlsci
[slurm-dev] Re: Fwd: Can I stop slurm from copying a script to execution node
If you don’t wish to do the submission from the “somepath” directory you can use the following sbatch option to achieve what you are looking for. -D, --workdir=directory Set the working directory of the batch script to directory before it is executed. Phil Eckert LLNL From: Thomas Johnson tho...@outdoorsnewzealand.co.nzmailto:tho...@outdoorsnewzealand.co.nz Reply-To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Date: Wednesday, July 9, 2014 at 7:31 PM To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Subject: [slurm-dev] Fwd: Can I stop slurm from copying a script to execution node I am submitting a job with sbatch /somepath/test.sh test.sh looks for a config files and other scripts in the same path e.g. /somepath/ /somepath/ is available to all submit and compute nodes. but slurm copies the script to /var/lib/slurm-llnl/slurmd/etc/ before executing it. Thus it test.sh can't find the required config and scripts. I'm changing over from sge where adding the -b y flag to qsub would stop sge from copying the script to the execution host. Is there a similar solution for slurm?
[slurm-dev] Re: pbsdsh -u equivalent
Hartley, Sounds like you might be wanting srun. If I ask for 5 nodes on our rzmerl system: salloc -p pdebug -N 5 salloc: Granted job allocation 1966117 srun hostname rzmerl1 rzmerl2 rzmerl4 rzmerl3 rzmerl5 Phil Eckert LLNL From: Hartley Greenwald jhgreenw...@gmail.commailto:jhgreenw...@gmail.com Reply-To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Date: Monday, June 30, 2014 at 2:23 PM To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Subject: [slurm-dev] pbsdsh -u equivalent Hi, Is there an equivalent command on slurm for the pbs command pbsdsh -u? That is to say, is there some command which will give one copy of a command to each node in a given allocation? I've combed through the documentation and there doesn't seem to be, but that struck me as odd that there wouldn't, so that's why I'm asking Thank you, Hartley
[slurm-dev] Re: moab/slurm question
Marti, If the job is submitted using msub, the release of the dependency would be need to be: mjobctl -m depend=none jobid If you use: mjobctl -m depend= jobid it only removes the dependency in Moab, not Slurm. This works fine if you are using just-ini-time scheduling, since the jobs only migrate to Slurm when they are have resources to run and dependencies have been met. But using the first method should work in both cases. Phil Eckert LLNL From: Hill, Marti T mh...@lanl.govmailto:mh...@lanl.gov Reply-To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Date: Wednesday, April 23, 2014 at 8:45 AM To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Subject: [slurm-dev] moab/slurm question It seems that I can remove a dependency from a job using mjobctl -m , but it does not remove the dependency as far as slurm goes. Squeue still shows the job held… What can we do? Marti
[slurm-dev] Re: backfill scheduler look ahead?
Bill, In addition to what Alejandro said, there is another consideration. You indicated the top two high priority jobs and the 30 core job, I'm assuming that the ... indicated a number of other queued jobs ahead of the 30 core job. Also, you didn't state it, but I'm also assuming there were other jobs running at the time. If both of these assumtions are true, then you would need to consider the completion time of all the running jobs in relation to the needs of the jobs ahead of the 30 core job in the queue. The 60 cores may be needed by a higher priority job that is waiting for a currently running job, or jobs, that will complete in less than two hours and provide the number of cores it needs. We have been using backfill batch systems, including SLURM, here at LLNL for over 20 years and trying to answer this question for our users is never easy. A conclusive way of determining when a job will either start or be backfilled is to do an squeue and an sinfo then map an X Y coordinates with time and nodes to represent the blocks that jobs will use. This is a bit painful, but will provide a lot of insight to backfill. I hope this is helpful. Phil Eckert LLNL On 2/21/14 2:57 AM, Alejandro Lucero Palau alejandro.luc...@bsc.es wrote: Hi Bill, I think Moe gives you the right answer but it was so concise it can be easily misunderstood. If we take the situation you describe with a simple analysis from backfilling algorithm point of view, the answer is job 300 should be scheduled without any impact on jobs 201 and 202. However, what I think Moe tried to say is there are other details to take into account, not just total number of free cores. Those cores could be really free but, for example, due to per-node memory requirements they can not be used. Or maybe you have reservations which are reserving some cores but you can not see it just looking at free cores. Or you have some licenses or partitions limitations. Or your system does not allow to share nodes so free cores does not mean you can use them. All this assuming you do not have other pending jobs between job 201 and job 300. There is a backfilling parameter max_job_bf which limits the number of jobs to be processed by the algorithm. The default number is 50. Also, as backfilling is so demanding it is suspended after some time. Before resuming, if something changed in the system, the backfilling algorithm will start from scratch. You can avoid this using bf_continue parameter. As you can see there are a lot of details which could have an impact. We have suffered this situation in the past and it is not always trivial to see the reason behind scheduling decisions. I added extra debug information for backfilling algorithm to see how resources were being reserved by pending jobs and it was helpful. Maybe it would be interesting to have some way for knowing why a job can not be scheduled. There are other resource managers giving this detailed information but it would have a cost, of course. On 02/21/2014 12:45 AM, Bill Wichser wrote: Moe, That's quite an obfusicated answer! I was looking for a yes, this is the expected behavior or no, something is amuck. In the case presented, again I'll say, it is clearly evident that the job waiting, number 300, can run. It has free cores, the job currently waiting will have plenty of cores available when the job it is waiting on finishes, yet it does not start simply because the time it requires would interfere with the current start time of the currently waiting job, #201. But the assertion that job 201 would be held up by starting job 300 is completely incorrect in this case. Now if this is the way the scheduler works, by being simple minded about time constraints, then it is what it is. I'm asking only if this behavior is the expected behavior. I think you are trying to say that indeed this is the case. Sincerely, Bill On 2/20/2014 1:21 PM, Moe Jette wrote: Slurm uses what is known as a conservative backfill scheduling algorithm. No job will be started that adversely impacts the expected start time of _any_ higher priority job. The scheduling can also be effected by a job's requirements for memory, generic resources, licenses, and resource limits. Moe Jette SchedMD LLC Quoting Bill Wichser b...@princeton.edu: Just a question on expected behavior of the backfill scheduler. This is an SMP machine if that matters. Scheduler is backfill with no preemption. I have a number of jobs queued. There are three which matter, ordered by priority. In the current state I have 60 free cores. job 201 needs 200 cores and will start in 1 hour requiring 24 hours of runtime job 202 needs 250 cores and will start in 5 hours requiring 24 hours of runtime ... job 300 needs 30 cores and will start in 300 hours requiring 2 hours of runtime The job completing in 1 hour will free 252 cores. Clearly, starting job 300 will not impact job 201's start time in any way. Yet
[slurm-dev] Re: Can't use sbatch with cron
A lot of suggestions of what to check for here: https://groups.google.com/forum/#!topic/slurm-devel/qduhQ5EbjaQ Phil Eckert LLNL On 11/21/13 5:00 PM, Arun Durvasula arun.durvas...@gmail.com wrote: Zero Bytes were transmitted or received
[slurm-dev] Re: Admin reservation on busy nodes
I see the nodes busy message only if I am trying to create a reservation on top of another reservation that includes the same nodes. You might try adding the overlap flag if this is the case. Phil Eckert LLNL From: Jacqueline Scoggins jscogg...@lbl.govmailto:jscogg...@lbl.gov Reply-To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Date: Tuesday, November 12, 2013 9:27 AM To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Subject: [slurm-dev] Re: Admin reservation on busy nodes I tried that and it stated that the nodes were busy. Jackie On Tue, Nov 12, 2013 at 9:16 AM, Paul Edmon ped...@cfa.harvard.edumailto:ped...@cfa.harvard.edu wrote: Include the ignore_jobs flag. That will force the reservation. -Paul Edmon- On 11/12/2013 12:11 PM, Jacqueline Scoggins wrote: Running slurm 2.5.7 and tried to reserve the nodes of the cluster because of hardware issues that needed to be repaired. Some of the nodes were allocated with jobs and others were not. Tried to do the following but got an error that the Nodes were busy and the reservation was not set. scontrol create reservation flags=ignore_jobs,maint starttime=now endtime=-mm-ddThh:mm partition=blah It would not work. Is there a way of setting system reservations on a partition even if there running jobs allocated to nodes? Thanks Jackie
[slurm-dev] Re: Admin reservation on busy nodes
Jackie, I was trying this with an earlier version of SLURM, I just build a 2.5.7 test system and tried it again, and I am seeing the same failures that you do when any of the nodes in the partition are allocated. A workaround is to use the nodes= option, ie: scontrol create reservation flags=ignore_jobs nodes=tnodes[32-591] starttime=now endtime=tomorrow partition=pbatch user=eckert Phil Eckert LLNL From: Jacqueline Scoggins jscogg...@lbl.govmailto:jscogg...@lbl.gov Reply-To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Date: Tuesday, November 12, 2013 10:58 AM To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Subject: [slurm-dev] Re: Admin reservation on busy nodes I also believe I tried that one as well as the other two and each time I got Nodes busy message. If the nodes are in alloc state will either of these flags work? From what I saw they would not work in this case. Jackie On Tue, Nov 12, 2013 at 9:34 AM, Eckert, Phil ecke...@llnl.govmailto:ecke...@llnl.gov wrote: I see the nodes busy message only if I am trying to create a reservation on top of another reservation that includes the same nodes. You might try adding the overlap flag if this is the case. Phil Eckert LLNL From: Jacqueline Scoggins jscogg...@lbl.govmailto:jscogg...@lbl.gov Reply-To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Date: Tuesday, November 12, 2013 9:27 AM To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Subject: [slurm-dev] Re: Admin reservation on busy nodes I tried that and it stated that the nodes were busy. Jackie On Tue, Nov 12, 2013 at 9:16 AM, Paul Edmon ped...@cfa.harvard.edumailto:ped...@cfa.harvard.edu wrote: Include the ignore_jobs flag. That will force the reservation. -Paul Edmon- On 11/12/2013 12:11 PM, Jacqueline Scoggins wrote: Running slurm 2.5.7 and tried to reserve the nodes of the cluster because of hardware issues that needed to be repaired. Some of the nodes were allocated with jobs and others were not. Tried to do the following but got an error that the Nodes were busy and the reservation was not set. scontrol create reservation flags=ignore_jobs,maint starttime=now endtime=-mm-ddThh:mm partition=blah It would not work. Is there a way of setting system reservations on a partition even if there running jobs allocated to nodes? Thanks Jackie
[slurm-dev] Re: Admin reservation on busy nodes
Jackie, it looks like in 2.5.7, according to the scontrol man page, the correct syntax would be: scontrol create reservation flags=PART_NODES,IGNORE_JOBS nodes=ALL starttime=now endtime=tomorrow partitionname=pbatch user=eckert but unfortunately, that doesn't work either. Phil From: Jacqueline Scoggins jscogg...@lbl.govmailto:jscogg...@lbl.gov Reply-To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Date: Tuesday, November 12, 2013 1:29 PM To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Subject: [slurm-dev] Re: Admin reservation on busy nodes Here is my problem Phil, My node name is not like n00[00-91] instead we have a suffix added to our hostname like n.jackie0. Since we have multiple n nodes we had to add the cluster they were associated to on the FQDN. And when I tried nodes='n00[00-91].jackie0' I got a message that the name of the names were not valid. So I tried only the partition and it still did not wok. Thanks Jackie On Tue, Nov 12, 2013 at 11:49 AM, Eckert, Phil ecke...@llnl.govmailto:ecke...@llnl.gov wrote: Jackie, I was trying this with an earlier version of SLURM, I just build a 2.5.7 test system and tried it again, and I am seeing the same failures that you do when any of the nodes in the partition are allocated. A workaround is to use the nodes= option, ie: scontrol create reservation flags=ignore_jobs nodes=tnodes[32-591] starttime=now endtime=tomorrow partition=pbatch user=eckert Phil Eckert LLNL From: Jacqueline Scoggins jscogg...@lbl.govmailto:jscogg...@lbl.gov Reply-To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Date: Tuesday, November 12, 2013 10:58 AM To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Subject: [slurm-dev] Re: Admin reservation on busy nodes I also believe I tried that one as well as the other two and each time I got Nodes busy message. If the nodes are in alloc state will either of these flags work? From what I saw they would not work in this case. Jackie On Tue, Nov 12, 2013 at 9:34 AM, Eckert, Phil ecke...@llnl.govmailto:ecke...@llnl.gov wrote: I see the nodes busy message only if I am trying to create a reservation on top of another reservation that includes the same nodes. You might try adding the overlap flag if this is the case. Phil Eckert LLNL From: Jacqueline Scoggins jscogg...@lbl.govmailto:jscogg...@lbl.gov Reply-To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Date: Tuesday, November 12, 2013 9:27 AM To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Subject: [slurm-dev] Re: Admin reservation on busy nodes I tried that and it stated that the nodes were busy. Jackie On Tue, Nov 12, 2013 at 9:16 AM, Paul Edmon ped...@cfa.harvard.edumailto:ped...@cfa.harvard.edu wrote: Include the ignore_jobs flag. That will force the reservation. -Paul Edmon- On 11/12/2013 12:11 PM, Jacqueline Scoggins wrote: Running slurm 2.5.7 and tried to reserve the nodes of the cluster because of hardware issues that needed to be repaired. Some of the nodes were allocated with jobs and others were not. Tried to do the following but got an error that the Nodes were busy and the reservation was not set. scontrol create reservation flags=ignore_jobs,maint starttime=now endtime=-mm-ddThh:mm partition=blah It would not work. Is there a way of setting system reservations on a partition even if there running jobs allocated to nodes? Thanks Jackie
[slurm-dev] Re: Job count exceeds limit
I believe you have exceeded the MaxJobCount specified in your slurm.conf, or have reached the default of 1 jobs. MaxJobCount The maximum number of jobs SLURM can have in its active database at one time. Set the values of MaxJobCount and MinJobAge to insure the slurmctld daemon does not exhaust its memory or other resources. Once this limit is reached, requests to submit additional jobs will fail. The default value is 1 jobs. This value may not be reset via scontrol reconfig. It only takes effect upon restart of the slurmctld daemon. Phil Eckert LLNL On 8/9/13 9:08 AM, Mario Kadastik mario.kadas...@cern.ch wrote: Hi, lately we've started to see this: [2013-08-09T18:57:12+03:00] error: create_job_record: job_count exceeds limit [2013-08-09T18:57:13+03:00] error: create_job_record: job_count exceeds limit [2013-08-09T18:57:16+03:00] error: create_job_record: job_count exceeds limit and I can't quite understand where it comes from. Mario Kadastik, PhD Senior researcher --- Physics is like sex, sure it may have practical reasons, but that's not why we do it -- Richard P. Feynman
[slurm-dev] Re: Job submit plugin to improve backfill
Another route that could be taken is to set the DefaultTime for a partition to 0, and the small patch attached to this email will reject a job when is has no time limit specified and the default_time limit is 0. I also modified the ESLURM_INVALID_TIME_LIMIT to include information that the error might be because of a missing time limit. Phil Eckert LLNL On 6/28/13 7:29 AM, Daniel M. Weeks week...@rpi.edu wrote: At CCNI, we use backfill scheduling on all our systems. However, we have found that users typically do not specify a time limit for their job so the scheduler assumes the maximum from QoS/user limits/partition limits/etc. This really hurts backfilling since the scheduler remains ignorant of short jobs. Attached is a small patch I wrote containing a job submit plugin and a new error message. The plugin rejects a job submission when it is missing a time limit and will provide the user with a clear and distinct error. I've just re-tested and the patch applies and builds cleanly on the slurm-2.5, slurm-2.6, and master branches. Please let me know if you find this useful, run across problems, or have suggestions/improvements. Thanks. -- Daniel M. Weeks Systems Programmer Computational Center for Nanotechnology Innovations Rensselaer Polytechnic Institute Troy, NY 12180 518-276-4458 spatch Description: spatch
[slurm-dev] Re: fairshare usage
Have you looked at sshare? Phil Eckert LLNL From: Mario Kadastik mario.kadas...@cern.chmailto:mario.kadas...@cern.ch Reply-To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Date: Tuesday, January 22, 2013 11:17 AM To: slurm-dev slurm-dev@schedmd.commailto:slurm-dev@schedmd.com Subject: [slurm-dev] fairshare usage Hi, is there some decent way to get multifactor fairshare current state? Something akin to maui's diagnose -f output that shows groups (accounts for slurm) and users with their fairshare target as well as their historic usage over the past N days. This would seriously help understand how the fairshare is computed based on the actual usage statistics and current cluster state. For example we have all user fairshares set as parent and for the accounts: Account Share -- - root 1 grid 1 grid-ops 1 hepusers 100 kbfiusers 1 now let's assume one of the users in hepusers spends the past N days computing with the full cluster and then another user submits a number of jobs it would be logical to assume that as there is no distinction between the users in an account the newcomers priority would be higher as (s)he hasn't had any allocated time. [root@slurm-1 slurm]# sreport cluster accountutilizationbyuser start=2013-01-08 Cluster/Account/User Utilization 2013-01-08T00:00:00 - 2013-01-21T23:59:59 (1209600 secs) Time reported in CPU Minutes Cluster Account Login Proper Name Used - --- - --- -- t2estoniaroot 7801048 t2estoniagrid0 t2estoniagridcms134 mapped user fo+ 0 t2estoniagrid sgmcms000 mapped user fo+ 0 t2estoniahepusers 7801048 t2estoniahepusersandres Andres Tiko 85048 t2estoniahepusers mario Mario Kadastik7716000 so according to this Mario (me) has computed a huge amount of time in comparison to andres. However if I look at the priorities from sinfo -nl I see this: [root@slurm-1 slurm]# sprio -nl|head -3 JOBID USER PRIORITY AGEFAIRSHARE JOBSIZEPARTITION QOS 53498mario 0.3497 0.2404977 0.4897101 0.9919238 1.000 0.000 53499mario 0.3497 0.2404977 0.4897101 0.9919238 1.000 0.000 [root@slurm-1 slurm]# sprio -nl|grep andres|head -1 53835 andres 0.3497 0.2396412 0.4897101 0.9919238 1.000 0.000 so in fact the fairshare factor is equivalent for both users no matter that one has been getting a lot of the resource while the other has not. or do I misunderstand the =parent part? I tried also setting all users shares to 1 and have no clue how long it will take for sprio to recompute this, but right now it's showing the same priorities. That's one of the reasons why I'd like to be able to see how the actual usage and decay over time affect the factor so that I can better understand the algorithm and tune the weights. Thanks, Mario Kadastik, PhD Researcher --- Physics is like sex, sure it may have practical reasons, but that's not why we do it -- Richard P. Feynman
[slurm-dev] Re: Problem submitting jobs from a non-compute node
I have scp'd it as moab.log.invalid.gz On 12/11/12 1:00 PM, Moe Jette je...@schedmd.com wrote: I would guess that your machine can communicate with the cluster's head node (where the slurmctld daemon executes and creates the job allocation), but not the compute nodes (where the slurmd daemons execute and spawn your tasks). It's probably a network issue. Quoting Reza Ramazani-Rend r.ramaz...@gmail.com: Hi, I am trying to set up a machine for submitting jobs to a cluster that uses slurm. But, when I try to submit a job, for example, using srun command, despite the job being allocated resources (for example using squeue shows the job running with the correct amount of resources allocated), it fails to run the application, and I have to terminate the srun process by a kill command on the local machine or use scancel to cancel the job and free the resources for other users. I tried to follow the instructions given on the mailing list for similar problems, and it seems that the machine that submits the job fails to receive signals from the compute node. I am attaching the output from ³scontrol show config², the srun command log (logsrunlocal from ³srun v p partitionname date 21 | tee log²), and the output of strace (from ³strace r f o logfile srun Š²). Other machines on the network with similar configurations can submit jobs without a problem. The log file from the ³srun vŠ² command does not indicate any problems that I could see until I terminate the job to free the resources (for comparison, logsrun301 is the log file from a successful run from one of the compute nodes). The strace log, however, shows that the client is waiting for a signal that it never receives (line 744, futex(0x4724ba4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, {1355239853, 0}, unfinished ..., and line745, ... rt_sigtimedwait resumed ) = 15). The munge daemon is running on the client, and the permissions to all the directories and files are set up as instructed in the installation document. I also thought selinux might be blocking the communications, but disabling it didn¹t help. I was wondering if you can identify any problems that I have overlooked or if anything is wrong with the set-up. Thank you.
[slurm-dev] Re: Job name env var not set correctly
In the sbatch code, it checks to see if a job name is provided, if so it will set the SLURM_JOB_NAME environment variable, but since the overwrite argument of the call is 0, it doesn't do so if the variable is already set, which is the case you are running into once the first job is submitted. if (opt.job_name) setenv(SLURM_JOB_NAME, opt.job_name, 0); each successive job is using the already set environment name as the job name. One way to accomplish what you are seeking would be to unset the SLURM_JOB_NAME environment variable prior to making the sbatch call in the script. On 10/9/12 9:10 AM, Carl Schmidtmann carl.schmidtm...@rochester.edu wrote: We've run into a foible of how sbatch sets up the environment for scripts... SLURM_JOB_NAME is supposed to reflect the currently assigned name (-J or SBATCH_JOB_NAME), but if you queue up a script from within an executing queue script, slurm does not overwrite SLURM_JOB_NAME with a new one regardless of what SBATCH_JOB_NAME is set to or the passed -J option. For example, #!/bin/bash echo My name is $SLURM_JOB_NAME if [ -e halt ] ; then rm halt ; exit 0 ; fi sbatch --partition=debug --time=0:1:0 --dependency=afterany:$SLURM_JOB_ID --nodes=1 --job-name=bar /path/to/this/script touch halt sleep 10 If I queue this up with, sbatch -p debug -t 0:1:0 --nodes 1 --job-name=foo script The first output will say My name is foo, but the second one will also say My name is foo. It turns out that you can change SLURM_JOB_NAME and it will be propagated to the next queued script. So inserting, SLURM_JOB_NAME=bar before the sbatch command works as expected. The other oddity is that the name that shows up in squeue will change. Just not the env var... -- Carl Schmidtmann Center for Integrated Research Computing University of Rochester
[slurm-dev] Re: Problem with quotes in sched/wiki2 plugin
According to adaptive this change was introduced in: 5_4 branch as of the .0 version changeset 7922ced7105a79a3 Phil Eckert LLNL On 6/6/12 1:29 PM, Eckert, Phil ecke...@llnl.gov wrote: In Moab 6.1 and later the Moab wiki does filter out the quotes in the data it gets from SLURM. We currently use SLURM 2.3.3 and Moab 6.1 and see none of the issues that Jon is seeing. Looking through the Moab wiki code I found the change that does this, and I have a query into Adaptive as to which release they first implemented it in. I will post the version when I hear back from them. Phil Eckert LLNL On 6/6/12 12:47 PM, Jon Bringhurst j...@lanl.gov wrote: I think a good end result would be this: * Use quotes in the wiki2 syntax to avoid the # issue. * Have the moab folks update their wiki specification to allow quotes, or at least find out why it doesn't already support quotes. * Update the slurm docs to replace Use Moab version 5.0.0 or higher with whatever version of moab supports quotes in wiki. In the meantime, I'm going to try to figure out what version of moab we need to upgrade to for when we upgrade to slurm newer than April 2012 on production clusters. It's probably overdue to make a push for upgrading from 5.3.5 anyway. -Jon On 06/06/2012 01:12 PM, Moe Jette wrote: My recollection is this change was made to address someone submitting a job in which the working directory contained a #. When Moab read job state information from SLURM, it interpreted the # as a job separator and could not parse anything after that point. Quoting the working directory name fixed the problem. This same problem could happen with several other fields that could contain a #. Removing this patch will restore this parsing problem. Moe Quoting Jon Bringhurst j...@lanl.gov: I'd like to propose backing out a patch, as well as removing quotes from SUBMITHOST in wiki2. http://bugs.schedmd.com/show_bug.cgi?id=29 https://github.com/SchedMD/slurm/commit/6cd20848dc3ed5375b637cbf34a6ba 6 af5fe9653 It's breaking several things when used with moab 5.3.5, including classes and accounts. For example, we're getting this error: NOTE: job violates constraints for partition slurm (partition slurm does not support requested class standard) note that standard should be standard. Here's a patch to back it out as well as remove the quotes from SUBMITHOST: diff --git a/src/plugins/sched/wiki2/get_jobs.c b/src/plugins/sched/wiki2/get_jobs.c index 3b6153e..ec5d75b 100644 --- a/src/plugins/sched/wiki2/get_jobs.c +++ b/src/plugins/sched/wiki2/get_jobs.c @@ -326,7 +326,7 @@ static char * _dump_job(struct job_record *job_ptr, time_t update_time) if (!IS_JOB_FINISHED(job_ptr) job_ptr-details job_ptr-details-work_dir) { - snprintf(tmp, sizeof(tmp), IWD=\%s\;, + snprintf(tmp, sizeof(tmp), IWD=%s;, job_ptr-details-work_dir); xstrcat(buf, tmp); } @@ -335,17 +335,17 @@ static char * _dump_job(struct job_record *job_ptr, time_t update_time) xstrcat(buf, FLAGS=INTERACTIVE;); if (job_ptr-gres) { - snprintf(tmp, sizeof(tmp),GRES=\%s\;, job_ptr-gres); + snprintf(tmp, sizeof(tmp),GRES=%s;, job_ptr-gres); xstrcat(buf, tmp); } if (job_ptr-resp_host) { - snprintf(tmp, sizeof(tmp),SUBMITHOST=\%s\;, job_ptr-resp_host); + snprintf(tmp, sizeof(tmp),SUBMITHOST=%s;, job_ptr-resp_host); xstrcat(buf, tmp); } if (job_ptr-wckey) { - snprintf(tmp, sizeof(tmp),WCKEY=\%s\;, job_ptr-wckey); + snprintf(tmp, sizeof(tmp),WCKEY=%s;, job_ptr-wckey); xstrcat(buf, tmp); } @@ -373,7 +373,7 @@ static char * _dump_job(struct job_record *job_ptr, time_t update_time) else pname = UNKNOWN; /* should never see this */ snprintf(tmp, sizeof(tmp), - QUEUETIME=%u;STARTTIME=%u;RCLASS=\%s\;, + QUEUETIME=%u;STARTTIME=%u;RCLASS=%s;, _get_job_submit_time(job_ptr), (uint32_t) job_ptr-start_time, pname); xstrcat(buf, tmp); @@ -407,7 +407,7 @@ static char * _dump_job(struct job_record *job_ptr, time_t update_time) if (job_ptr-account) { snprintf(tmp, sizeof(tmp), - ACCOUNT=\%s\;, job_ptr-account); + ACCOUNT=%s;, job_ptr-account); xstrcat(buf, tmp); } -Jon
[slurm-dev] Re: Implementing soft limits and notifications with Slurm/Moab
Michael, I was curious, so I tried the: RESOURCELIMITPOLICY:ALWAYS,EXTENDEDVIOLATION:NOTIFY,CANCEL:12:00:00 parameter on my test cluster so that I could observe the behavior, and I also used the OverTimeLimit parameter in my SLURM test system. When the initial time limit is reached, I see that the job remaining time in Moab goes negative. From what I've read, Torque supports a hard and soft limit, so when it uses its initial time, the time remaining reflects to the extended value, but that fact that with SLURM showing a negative value, at least it is a an indication that the job is running on the extended time allotment. You are saying the jobs shows cancelled after using the initial time, but I have found that if use the Moab parameter: JOBMAXOVERRUN 12:00:00 in my moab.cfg, the job will stay in the system and showq will display the job (reflecting a negative time value) until completion. Phil Eckert LLNL On 6/5/12 8:33 AM, Michael Gutteridge michael.gutteri...@gmail.com wrote: On Mon, Jun 4, 2012 at 1:48 PM, Lipari, Don lipa...@llnl.gov wrote: What appears to be happening is that Moab is sending the canceljob message to SLURM when the job's time limit expires. It should email the user at that point, but hold off issuing the canceljob command to SLURM until Moab's EXTENDEDVIOLATION grace period - 12 hours in this case - has transpired. I didn't go into this in detail, but it is slurm that is issuing the cancel command to the job at the originally specified end time- why I originally set OverTimeLimit=UNLIMITED. Moab is not sending the cancel command until it reaches EXTENDEDVIOLATION. By setting SLURM's OverTimeLimit to match Moab's grace period, Michael has solved the problem. What happens at that point is that the job's EndTime is set to the time at which EXTENDEDVIOLATION was reached. That's when the OverTimeLimit timer takes over- thus, slurm won't cancel the job until StartTime + WallTime + EXTENDEDVIOLATION + OverTimeLimit. It works, but Moab is confused about the job state after EXTENDEDVIOLATION (i.e., it thinks the job has been cancelled, but the RM reports it active). So yes, eventually this works, but has undesirable side effects (i.e. the job isn't visible in showq, I don't know how the resources would be scheduled, etc.) If the above changes to Moab behavior are not made, I would recommend using SLURM's OverTimeLimit as Michael described. However, I don't see the need to eliminate _timeout_job function from the wiki*/cancel_job.c modules. What I've put together (but haven't tried out yet) is leaving the _timeout_job module as is, but adding the job cancel code from _cancel_job. So it both sets EndTime (which I'm guessing might be good for accounting purposes) and cancels the job. Might be redundant, but likely harmless anyway. Don -Original Message- From: Moe Jette [mailto:je...@schedmd.com] Sent: Monday, June 04, 2012 11:29 AM To: slurm-dev Subject: [slurm-dev] Re: Implementing soft limits and notifications with Slurm/Moab The code in question dates back about six years to the first SLURM/Moab integration. I have no idea what the reason is for the reason for the different treatment of job cancellation for time limit and an administrator cancellation. I can understand the problem caused by the current SLURM code and your configuration. It seems that removing the _timeout_job function and calling the _cancel_job() function in all cases is reasonable. If you want to validate that and respond to the list, we can change the SLURM code. Quoting Michael Gutteridge michael.gutteri...@gmail.com: I have kind of an interesting situation. We'd like to enable jobs to overrun their requested time by some amount as well as provide notifications when that wall time is close to used up. We've got Moab Workload Manager (6.1.6) and Slurm 2.3.5 installed. I'd originally attempted to use Moab's resource limit policy: RESOURCELIMITPOLICY:ALWAYS,EXTENDEDVIOLATION:NOTIFY,CANCEL:12:00:00 Meaning that when the job goes over time, moab notifies the user but then cancels the job after it's gone 12 hours past it's wall time. Now, this initially didn't work- Slurm just kills the job. I set OverTimeLimit=UNLIMITED and then I got the notifications OK... but when the job reaches its overtime limit, the job isn't cancelled. Moab cancels the job. I see it send Slurm the message via wiki2: 05/31 11:15:31 INFO: message sent: 'CMD=CANCELJOB ARG=1508 TYPE=WALLCLOCK' And I see slurm acknowledge the event: 112785 05/31 11:15:31 INFO: received message 'CK=8512712decedc584 TS=1338488131 AUTH=slurm DT=SC=0 RESPONSE=job 1508 cancelled successfully' from wiki server 112786 05/31 11:15:31 MSUDisconnect(9) 112787 05/31 11:15:31 INFO: job '1508' cancelled through WIKI RM At higher log levels I see that Slurm sets the end time for the job to the current time. In