[slurm-dev] Re: Newb question about plugins
Update: In the interest of helping out anyone else in the future who may have my problem, I'm posting what the solution to the problem. All I had to do was add the line JobSubmitPlugins=job_submit/require_timelimit to the slurm.conf. It would have saved so much time and trouble if this syntax was documented in the slurm.conf man page, but I couldn't find anything anywhere on how to properly use it. Just had to go through a lot of trial and error. On Thu, Mar 19, 2015 at 2:37 PM, Michael Kit Gilbert m...@nau.edu wrote: Thanks again for the help! OS: CentOS 6.5 Slurm version: 14.11.2 Compiler: gcc 4.4.7 Getting rid of the plugstack.conf file allowed me to start running jobs again, but the plugin that I'm wanting to work doesn't appear to be enabled. There are a bunch of *.so plugin files in the /usr/lib64/slurm directory. One of them is job_submit_require_timelimit.so. I assume that since these are installed here that they were compiled when slurm was installed. So since they're in this directory, how do I enable them? I want people to be forced to enter a time limit and that is what this plugin appears to do. On Thu, Mar 19, 2015 at 1:46 PM, Andy Riebs andy.ri...@hp.com wrote: OK, we (or at least I) have reached the point where you need to provide some more information: * What operating system and version? * What Slurm version? * What compiler? You apparently have some kind of build problem, as Slurm plugins are required to export a specific set of symbols; they seem not to be exported in your plugin. Have you gotten Slurm to run without the plugin? That's a useful first step before adding anything that is optional. (BTW, did you discover that MailProg is a requirement, once you get further down the road?) Andy On 03/19/2015 04:33 PM, Michael Kit Gilbert wrote: Update: So, I have figured out the problem with slurm not running properly. It had to do with my fstab file being incorrect and not mounting /var/spool correctly. Now I can start slurm correctly. However, when trying to run a job, slurm doesn't load the plugin properly, so it fails with the following message: sbatch: error: spank: /usr/lib64/slurm/job_submit_require_timelimit.so exports 0 symbols sbatch: error: spank: /etc/slurm/plugstack.conf:7: Failed to load plugin /usr/lib64/slurm/job_submit_require_timelimit.so. Aborting. sbatch: error: Failed to initialize plugin stack I posted the slurm.conf and plugstack.conf changes I made in the first post. Thanks for any help! On Thu, Mar 19, 2015 at 11:55 AM, Michael Kit Gilbert m...@nau.edu wrote: Thank you so much for the reply, Andy. Well, apparently there's a lot happening that may be causing the issue. First, I can't seem to get slurmctld running properly. When I run slurmctld -D, this is my output: slurmctld: error: Can't save state, create file /var/spool/slurm/last_config_lite.new error Permission denied slurmctld: error: Configured MailProg is invalid slurmctld: Job accounting information stored, but details not gathered slurmctld: fatal: Incorrect permissions on state save loc: /var/spool/slurm I have the MailProg line in slurm.conf commented out, so does it have to be specified to work? Also, since I'm root and root is the owner of the /var/spool/slurm directory, I'm not sure why it's telling me the permissions are incorrect... On Thu, Mar 19, 2015 at 10:20 AM, Andy Riebs andy.ri...@hp.com wrote: Michael, Try running slurmctld -D which should result in output telling you what's going wrong. Andy On 03/19/2015 01:15 PM, Michael Kit Gilbert wrote: Sorry for the basic question, but I am new to slurm and am having some basic problems with plugins. What I'd like to do is make the job_submit_require_timelimit.so plugin that is found in the source code active and required for all jobs. What I've done so far is I've added the line *PluginDir=/usr/lib64/slurm* to slurm.conf and I've created a plugstack.conf file that has one line in it: *requiredjob_submit_require_timelimit.so* And now slurm won't start at all. So obviously I've made a huge newbie error. I've verified that our plugins are found in the /usr/lib64/slurm directory, but I can't tell what else I need to do. Does this plugin require arguments? Is there something else I'm missing? Thanks, Mike
[slurm-dev] Re: Slurm is refusing to establish a connection between nodes and controller
Felix, How does the routing table look on the controller? Is the IB network listed on the controller using the correct interface? John DeSantis 2015-03-19 10:48 GMT-04:00 Felix Willenborg felix.willenb...@uni-oldenburg.de: So i tried out installing the latest package (14.11.4-1) of slurm with no success - unfortunately. I kept an eye on the compilation of the Infiniband Plugin, that it is loaded in the slurmd and that a acct_gathering.conf is available. Still, i have the same problem. I assume that i'm not configuring slurm correctly with regard to Infiniband. Are there possibilities where i can make any mistakes?
[slurm-dev] Re: Newb question about plugins
Michael, Try running slurmctld -D which should result in output telling you what's going wrong. Andy On 03/19/2015 01:15 PM, Michael Kit Gilbert wrote: Newb question about plugins Sorry for the basic question, but I am new to slurm and am having some basic problems with plugins. What I'd like to do is make the job_submit_require_timelimit.so plugin that is found in the source code active and required for all jobs. What I've done so far is I've added the line *PluginDir=/usr/lib64/slurm* to slurm.conf and I've created a plugstack.conf file that has one line in it: *required job_submit_require_timelimit.so* And now slurm won't start at all. So obviously I've made a huge newbie error. I've verified that our plugins are found in the /usr/lib64/slurm directory, but I can't tell what else I need to do. Does this plugin require arguments? Is there something else I'm missing? Thanks, Mike
[slurm-dev] Re: Slurm is refusing to establish a connection between nodes and controller
So i tried out installing the latest package (14.11.4-1) of slurm with no success - unfortunately. I kept an eye on the compilation of the Infiniband Plugin, that it is loaded in the slurmd and that a acct_gathering.conf is available. Still, i have the same problem. I assume that i'm not configuring slurm correctly with regard to Infiniband. Are there possibilities where i can make any mistakes?
[slurm-dev] Re: Newb question about plugins
Thank you so much for the reply, Andy. Well, apparently there's a lot happening that may be causing the issue. First, I can't seem to get slurmctld running properly. When I run slurmctld -D, this is my output: slurmctld: error: Can't save state, create file /var/spool/slurm/last_config_lite.new error Permission denied slurmctld: error: Configured MailProg is invalid slurmctld: Job accounting information stored, but details not gathered slurmctld: fatal: Incorrect permissions on state save loc: /var/spool/slurm I have the MailProg line in slurm.conf commented out, so does it have to be specified to work? Also, since I'm root and root is the owner of the /var/spool/slurm directory, I'm not sure why it's telling me the permissions are incorrect... On Thu, Mar 19, 2015 at 10:20 AM, Andy Riebs andy.ri...@hp.com wrote: Michael, Try running slurmctld -D which should result in output telling you what's going wrong. Andy On 03/19/2015 01:15 PM, Michael Kit Gilbert wrote: Sorry for the basic question, but I am new to slurm and am having some basic problems with plugins. What I'd like to do is make the job_submit_require_timelimit.so plugin that is found in the source code active and required for all jobs. What I've done so far is I've added the line *PluginDir=/usr/lib64/slurm* to slurm.conf and I've created a plugstack.conf file that has one line in it: *requiredjob_submit_require_timelimit.so* And now slurm won't start at all. So obviously I've made a huge newbie error. I've verified that our plugins are found in the /usr/lib64/slurm directory, but I can't tell what else I need to do. Does this plugin require arguments? Is there something else I'm missing? Thanks, Mike
[slurm-dev] Re: Newb question about plugins
Update: So, I have figured out the problem with slurm not running properly. It had to do with my fstab file being incorrect and not mounting /var/spool correctly. Now I can start slurm correctly. However, when trying to run a job, slurm doesn't load the plugin properly, so it fails with the following message: sbatch: error: spank: /usr/lib64/slurm/job_submit_require_timelimit.so exports 0 symbols sbatch: error: spank: /etc/slurm/plugstack.conf:7: Failed to load plugin /usr/lib64/slurm/job_submit_require_timelimit.so. Aborting. sbatch: error: Failed to initialize plugin stack I posted the slurm.conf and plugstack.conf changes I made in the first post. Thanks for any help! On Thu, Mar 19, 2015 at 11:55 AM, Michael Kit Gilbert m...@nau.edu wrote: Thank you so much for the reply, Andy. Well, apparently there's a lot happening that may be causing the issue. First, I can't seem to get slurmctld running properly. When I run slurmctld -D, this is my output: slurmctld: error: Can't save state, create file /var/spool/slurm/last_config_lite.new error Permission denied slurmctld: error: Configured MailProg is invalid slurmctld: Job accounting information stored, but details not gathered slurmctld: fatal: Incorrect permissions on state save loc: /var/spool/slurm I have the MailProg line in slurm.conf commented out, so does it have to be specified to work? Also, since I'm root and root is the owner of the /var/spool/slurm directory, I'm not sure why it's telling me the permissions are incorrect... On Thu, Mar 19, 2015 at 10:20 AM, Andy Riebs andy.ri...@hp.com wrote: Michael, Try running slurmctld -D which should result in output telling you what's going wrong. Andy On 03/19/2015 01:15 PM, Michael Kit Gilbert wrote: Sorry for the basic question, but I am new to slurm and am having some basic problems with plugins. What I'd like to do is make the job_submit_require_timelimit.so plugin that is found in the source code active and required for all jobs. What I've done so far is I've added the line *PluginDir=/usr/lib64/slurm* to slurm.conf and I've created a plugstack.conf file that has one line in it: *requiredjob_submit_require_timelimit.so* And now slurm won't start at all. So obviously I've made a huge newbie error. I've verified that our plugins are found in the /usr/lib64/slurm directory, but I can't tell what else I need to do. Does this plugin require arguments? Is there something else I'm missing? Thanks, Mike
[slurm-dev] Re: Newb question about plugins
OK, we (or at least I) have reached the point where you need to provide some more information: * What operating system and version? * What Slurm version? * What compiler? You apparently have some kind of build problem, as Slurm plugins are required to export a specific set of symbols; they seem not to be exported in your plugin. Have you gotten Slurm to run without the plugin? That's a useful first step before adding anything that is optional. (BTW, did you discover that MailProg is a requirement, once you get further down the road?) Andy On 03/19/2015 04:33 PM, Michael Kit Gilbert wrote: Re: [slurm-dev] Re: Newb question about plugins Update: So, I have figured out the problem with slurm not running properly. It had to do with my fstab file being incorrect and not mounting /var/spool correctly. Now I can start slurm correctly. However, when trying to run a job, slurm doesn't load the plugin properly, so it fails with the following message: sbatch: error: spank: /usr/lib64/slurm/job_submit_require_timelimit.so exports 0 symbols sbatch: error: spank: /etc/slurm/plugstack.conf:7: Failed to load plugin /usr/lib64/slurm/job_submit_require_timelimit.so. Aborting. sbatch: error: Failed to initialize plugin stack I posted the slurm.conf and plugstack.conf changes I made in the first post. Thanks for any help! On Thu, Mar 19, 2015 at 11:55 AM, Michael Kit Gilbert m...@nau.edu wrote: Thank you so much for the reply, Andy. Well, apparently there's a lot happening that may be causing the issue. First, I can't seem to get slurmctld running properly. When I run slurmctld -D, this is my output: slurmctld: error: Can't save state, create file /var/spool/slurm/last_config_lite.new error Permission denied slurmctld: error: Configured MailProg is invalid slurmctld: Job accounting information stored, but details not gathered slurmctld: fatal: Incorrect permissions on state save loc: /var/spool/slurm I have the MailProg line in slurm.conf commented out, so does it have to be specified to work? Also, since I'm root and root is the owner of the /var/spool/slurm directory, I'm not sure why it's telling me the permissions are incorrect... On Thu, Mar 19, 2015 at 10:20 AM, Andy Riebs andy.ri...@hp.com wrote: Michael, Try running slurmctld -D which should result in output telling you what's going wrong. Andy On 03/19/2015 01:15 PM, Michael Kit Gilbert wrote: Sorry for the basic question, but I am new to slurm and am having some basic problems with plugins. What I'd like to do is make the job_submit_require_timelimit.so plugin that is found in the source code active and required for all jobs. What I've done so far is I've added the line *PluginDir=/usr/lib64/slurm* to slurm.conf and I've created a plugstack.conf file that has one line in it: *required job_submit_require_timelimit.so* And now slurm won't start at all. So obviously I've made a huge newbie error. I've verified that our plugins are found in the /usr/lib64/slurm directory, but I can't tell what else I need to do. Does this plugin require arguments? Is there something else I'm missing? Thanks, Mike
[slurm-dev] Slurm versions 14.11.5 and 15.08.0-pre3 are now available
Version 14.11.5 contains quite a few bug fixes generated over the past five weeks including two high impact bugs. There is a fix for the slurmdbd daemon aborting if a node is set to a DOWN state and it's reason field is NULL. The other important bug fix will prevent someone from being able to kill a job array belonging to another user. Details about all of the changes are appended. Version 15.08.0-pre3 represents the current state of Slurm development for the release planned in August 2015 and is intended for development and test purposes only. Notable enhancements include power capping support for Cray systems and add the ability for a compute node to be allocated to multiple jobs, but restricted to one user at a time. Both versions can be downloaded from http://www.schedmd.com/#repos * Changes in Slurm 14.11.5 == -- Correct the squeue command taking into account that a node can have NULL name if it is not in DNS but still in slurm.conf. -- Fix slurmdbd regression which would cause a segfault when a node is set down with no reason. -- BGQ - Fix issue with job arrays not being handled correctly in the runjob_mux plugin. -- Print FAIR_TREE, if configured, in scontrol show config output for PriorityFlags. -- Add SLURM_JOB_GPUS environment variable to those available in the Prolog. -- Load lua-5.2 library if using lua5.2 for lua job submit plugin. -- GRES logic: Prevent bad node_offset due to not preserving no_consume flag. -- Fix wrong variables used in the wrapper functions needed for systems that don't support strong_alias -- Fix code for apple computers SOL_TCP is not defined -- Cray/BASIL - Check for mysql credentials in /root/.my.cnf. -- Fix sprio showing wrong priority for job arrays until priority is recalculated. -- Account to batch step all CPUs that are allocated to a job not just one since the batch step has access to all CPUs like other steps. -- Fix job getting EligibleTime set before meeting dependency requirements. -- Correct the initialization of QOS MinCPUs per job limit. -- Set the debug level of information messages in cgroup plugin to debug2. -- For job running under a debugger, if the exec of the task fails, then cancel its I/O and abort immediately rather than waiting 60 seconds for I/O timeout. -- Fix associations not getting default qos set until after a restart. -- Set the value of total_cpus not to be zero before invoking acct_policy_job_runnable_post_select. -- MySQL - When requesting cluster resources, only return resources for the cluster(s) requested. -- Add TaskPluginParam=autobind=threads option to set a default binding in the case that auto binding doesn't find a match. -- Introduce a new SchedulerParameters variable nohold_on_prolog_fail. If configured don't requeue jobs on hold is a Prolog fails. -- Make it so sched_params isn't read over and over when an epilog complete message comes in -- Fix squeue -L licenses not filtering out jobs with licenses. -- Changed the implementation of xcpuinfo_abs_to_mac() be identical _abs_to_mac() to fix CPUs allocation using cpuset cgroup. -- Improve the explanation of the unbuffered feature in the srun man page. -- Make taskplugin=cgroup work for core spec. needed to have task/cgroup before. -- Fix reports not using the month usage table. -- BGQ - Sanity check given for translating small blocks into slurm bg_records. -- Fix bug preventing the requeue/hold or requeue/special_exit of job from the completing state. -- Cray - Fix for launching batch step within an existing job allocation. -- Cray - Add ALPS_APP_ID_ENV environment variable. -- Increase maximum MaxArraySize configuration parameter value from 1,000,001 to 4,000,001. -- Added new SchedulerParameters value of bf_min_age_reserve. The backfill scheduler will not reserve resources for pending jobs until they have been pending for at least the specified number of seconds. This can be valuable if jobs lack time limits or all time limits have the same value. -- Fix support for --mem=0 (all memory of a node) with select/cons_res plugin. -- Fix bug that can permit someone to kill job array belonging to another user. -- Don't set the default partition on a license only reservation. -- Show a NodeCnt=0, instead of NO_VAL, in scontrol show res for a license only reservation. -- BGQ - When using static small blocks make sure when clearing the job the block is set up to it's original state. -- Start job allocation using lowest numbered sockets for block task distribution for consistency with cyclic distribution. * Changes in Slurm 15.08.0pre3 == -- CRAY - addition of acct_gather_energy/cray plugin. -- Add job credential to Run Prolog RPC used with a configuration of PrologFlags=alloc. This allows the Prolog to be passed identification of GPUs allocated to
[slurm-dev] Re: Newb question about plugins
Thanks again for the help! OS: CentOS 6.5 Slurm version: 14.11.2 Compiler: gcc 4.4.7 Getting rid of the plugstack.conf file allowed me to start running jobs again, but the plugin that I'm wanting to work doesn't appear to be enabled. There are a bunch of *.so plugin files in the /usr/lib64/slurm directory. One of them is job_submit_require_timelimit.so. I assume that since these are installed here that they were compiled when slurm was installed. So since they're in this directory, how do I enable them? I want people to be forced to enter a time limit and that is what this plugin appears to do. On Thu, Mar 19, 2015 at 1:46 PM, Andy Riebs andy.ri...@hp.com wrote: OK, we (or at least I) have reached the point where you need to provide some more information: * What operating system and version? * What Slurm version? * What compiler? You apparently have some kind of build problem, as Slurm plugins are required to export a specific set of symbols; they seem not to be exported in your plugin. Have you gotten Slurm to run without the plugin? That's a useful first step before adding anything that is optional. (BTW, did you discover that MailProg is a requirement, once you get further down the road?) Andy On 03/19/2015 04:33 PM, Michael Kit Gilbert wrote: Update: So, I have figured out the problem with slurm not running properly. It had to do with my fstab file being incorrect and not mounting /var/spool correctly. Now I can start slurm correctly. However, when trying to run a job, slurm doesn't load the plugin properly, so it fails with the following message: sbatch: error: spank: /usr/lib64/slurm/job_submit_require_timelimit.so exports 0 symbols sbatch: error: spank: /etc/slurm/plugstack.conf:7: Failed to load plugin /usr/lib64/slurm/job_submit_require_timelimit.so. Aborting. sbatch: error: Failed to initialize plugin stack I posted the slurm.conf and plugstack.conf changes I made in the first post. Thanks for any help! On Thu, Mar 19, 2015 at 11:55 AM, Michael Kit Gilbert m...@nau.edu wrote: Thank you so much for the reply, Andy. Well, apparently there's a lot happening that may be causing the issue. First, I can't seem to get slurmctld running properly. When I run slurmctld -D, this is my output: slurmctld: error: Can't save state, create file /var/spool/slurm/last_config_lite.new error Permission denied slurmctld: error: Configured MailProg is invalid slurmctld: Job accounting information stored, but details not gathered slurmctld: fatal: Incorrect permissions on state save loc: /var/spool/slurm I have the MailProg line in slurm.conf commented out, so does it have to be specified to work? Also, since I'm root and root is the owner of the /var/spool/slurm directory, I'm not sure why it's telling me the permissions are incorrect... On Thu, Mar 19, 2015 at 10:20 AM, Andy Riebs andy.ri...@hp.com wrote: Michael, Try running slurmctld -D which should result in output telling you what's going wrong. Andy On 03/19/2015 01:15 PM, Michael Kit Gilbert wrote: Sorry for the basic question, but I am new to slurm and am having some basic problems with plugins. What I'd like to do is make the job_submit_require_timelimit.so plugin that is found in the source code active and required for all jobs. What I've done so far is I've added the line *PluginDir=/usr/lib64/slurm* to slurm.conf and I've created a plugstack.conf file that has one line in it: *requiredjob_submit_require_timelimit.so* And now slurm won't start at all. So obviously I've made a huge newbie error. I've verified that our plugins are found in the /usr/lib64/slurm directory, but I can't tell what else I need to do. Does this plugin require arguments? Is there something else I'm missing? Thanks, Mike