[slurm-dev] Re: Newb question about plugins

2015-03-19 Thread Michael Kit Gilbert
Update: In the interest of helping out anyone else in the future who may
have my problem, I'm posting what the solution to the problem.

All I had to do was add the line

 JobSubmitPlugins=job_submit/require_timelimit

to the slurm.conf. It would have saved so much time and trouble if this
syntax was documented in the slurm.conf man page, but I couldn't find
anything anywhere on how to properly use it. Just had to go through a lot
of trial and error.

On Thu, Mar 19, 2015 at 2:37 PM, Michael Kit Gilbert m...@nau.edu wrote:

 Thanks again for the help!
 OS: CentOS 6.5
 Slurm version: 14.11.2
 Compiler: gcc 4.4.7

 Getting rid of the plugstack.conf file allowed me to start running jobs
 again, but the plugin that I'm wanting to work doesn't appear to be enabled.

 There are a bunch of *.so plugin files in the /usr/lib64/slurm directory.
 One of them is job_submit_require_timelimit.so. I assume that since these
 are installed here that they were compiled when slurm was installed. So
 since they're in this directory, how do I enable them? I want people to be
 forced to enter a time limit and that is what this plugin appears to do.

 On Thu, Mar 19, 2015 at 1:46 PM, Andy Riebs andy.ri...@hp.com wrote:

  OK, we (or at least I) have reached the point where you need to provide
 some more information:
 * What operating system and version?
 * What Slurm version?
 * What compiler?

 You apparently have some kind of build problem, as Slurm plugins are
 required to export a specific set of symbols; they seem not to be exported
 in your plugin.

 Have you gotten Slurm to run without the plugin? That's a useful first
 step before adding anything that is optional. (BTW, did you discover that
 MailProg is a requirement, once you get further down the road?)

 Andy


 On 03/19/2015 04:33 PM, Michael Kit Gilbert wrote:

 Update: So, I have figured out the problem with slurm not running
 properly. It had to do with my fstab file being incorrect and not mounting
 /var/spool correctly.

  Now I can start slurm correctly. However, when trying to run a job,
 slurm doesn't load the plugin properly, so it fails with the following
 message:

  sbatch: error: spank:
 /usr/lib64/slurm/job_submit_require_timelimit.so exports 0 symbols
 sbatch: error: spank: /etc/slurm/plugstack.conf:7: Failed to load plugin
 /usr/lib64/slurm/job_submit_require_timelimit.so. Aborting.
 sbatch: error: Failed to initialize plugin stack

  I posted the slurm.conf and plugstack.conf changes I made in the first
 post. Thanks for any help!

 On Thu, Mar 19, 2015 at 11:55 AM, Michael Kit Gilbert m...@nau.edu
 wrote:

  Thank you so much for the reply, Andy. Well, apparently there's a lot
 happening that may be causing the issue. First, I can't seem to get
 slurmctld running properly. When I run slurmctld -D, this is my output:

  slurmctld: error: Can't save state, create file
 /var/spool/slurm/last_config_lite.new error Permission denied
 slurmctld: error: Configured MailProg is invalid
 slurmctld: Job accounting information stored, but details not gathered
 slurmctld: fatal: Incorrect permissions on state save loc:
 /var/spool/slurm

  I have the MailProg line in slurm.conf commented out, so does it have
 to be specified to work? Also, since I'm root and root is the owner of the
 /var/spool/slurm directory, I'm not sure why it's telling me the
 permissions are incorrect...

 On Thu, Mar 19, 2015 at 10:20 AM, Andy Riebs andy.ri...@hp.com wrote:

  Michael,

 Try running slurmctld -D which should result in output telling you
 what's going wrong.

 Andy



 On 03/19/2015 01:15 PM, Michael Kit Gilbert wrote:

 Sorry for the basic question, but I am new to slurm and am having some
 basic problems with plugins. What I'd like to do is make the
 job_submit_require_timelimit.so plugin that is found in the source code
 active and required for all jobs.

  What I've done so far is I've added the line

  *PluginDir=/usr/lib64/slurm*

  to slurm.conf and I've created a plugstack.conf file that has one
 line in it:

  *requiredjob_submit_require_timelimit.so*

  And now slurm won't start at all. So obviously I've made a huge
 newbie error. I've verified that our plugins are found in the
 /usr/lib64/slurm directory, but I can't tell what else I need to do.
 Does this plugin require arguments? Is there something else I'm missing?

  Thanks,

  Mike









[slurm-dev] Re: Slurm is refusing to establish a connection between nodes and controller

2015-03-19 Thread John Desantis

Felix,

How does the routing table look on the controller?

Is the IB network listed on the controller using the correct interface?

John DeSantis

2015-03-19 10:48 GMT-04:00 Felix Willenborg felix.willenb...@uni-oldenburg.de:

 So i tried out installing the latest package (14.11.4-1) of slurm with no
 success - unfortunately. I kept an eye on the compilation of the Infiniband
 Plugin, that it is loaded in the slurmd and that a acct_gathering.conf is
 available. Still, i have the same problem. I assume that i'm not configuring
 slurm correctly with regard to Infiniband. Are there possibilities where i
 can make any mistakes?


[slurm-dev] Re: Newb question about plugins

2015-03-19 Thread Andy Riebs
   Michael,
 
 Try running slurmctld -D which should result in output telling you
 what's going wrong.
 
 Andy
 On 03/19/2015 01:15 PM, Michael Kit
   Gilbert wrote:
   Newb question about plugins
   
   Sorry for the basic question, but I am new to slurm
 and am having some basic problems with plugins. What I'd like to
 do is make the job_submit_require_timelimit.so plugin that is
 found in the source code active and required for all jobs. 
 What I've done so far is I've added the line 
 *PluginDir=/usr/lib64/slurm*
 to slurm.conf and I've created a plugstack.conf file that
   has one line in it:
 *required        job_submit_require_timelimit.so*     
 And now slurm won't start at all. So obviously I've made a
   huge newbie error. I've verified that our plugins are found in
   the /usr/lib64/slurm directory, but I can't tell what else I
   need to do. 
 Does this plugin require arguments? Is there something else
   I'm missing?
 Thanks,
 Mike


[slurm-dev] Re: Slurm is refusing to establish a connection between nodes and controller

2015-03-19 Thread Felix Willenborg


So i tried out installing the latest package (14.11.4-1) of slurm with 
no success - unfortunately. I kept an eye on the compilation of the 
Infiniband Plugin, that it is loaded in the slurmd and that a 
acct_gathering.conf is available. Still, i have the same problem. I 
assume that i'm not configuring slurm correctly with regard to 
Infiniband. Are there possibilities where i can make any mistakes?


[slurm-dev] Re: Newb question about plugins

2015-03-19 Thread Michael Kit Gilbert
Thank you so much for the reply, Andy. Well, apparently there's a lot
happening that may be causing the issue. First, I can't seem to get
slurmctld running properly. When I run slurmctld -D, this is my output:

slurmctld: error: Can't save state, create file
/var/spool/slurm/last_config_lite.new error Permission denied
slurmctld: error: Configured MailProg is invalid
slurmctld: Job accounting information stored, but details not gathered
slurmctld: fatal: Incorrect permissions on state save loc: /var/spool/slurm

I have the MailProg line in slurm.conf commented out, so does it have to be
specified to work? Also, since I'm root and root is the owner of the
/var/spool/slurm directory, I'm not sure why it's telling me the
permissions are incorrect...

On Thu, Mar 19, 2015 at 10:20 AM, Andy Riebs andy.ri...@hp.com wrote:

  Michael,

 Try running slurmctld -D which should result in output telling you
 what's going wrong.

 Andy



 On 03/19/2015 01:15 PM, Michael Kit Gilbert wrote:

 Sorry for the basic question, but I am new to slurm and am having some
 basic problems with plugins. What I'd like to do is make the
 job_submit_require_timelimit.so plugin that is found in the source code
 active and required for all jobs.

  What I've done so far is I've added the line

  *PluginDir=/usr/lib64/slurm*

  to slurm.conf and I've created a plugstack.conf file that has one line
 in it:

  *requiredjob_submit_require_timelimit.so*

  And now slurm won't start at all. So obviously I've made a huge newbie
 error. I've verified that our plugins are found in the /usr/lib64/slurm
 directory, but I can't tell what else I need to do.
 Does this plugin require arguments? Is there something else I'm missing?

  Thanks,

  Mike





[slurm-dev] Re: Newb question about plugins

2015-03-19 Thread Michael Kit Gilbert
Update: So, I have figured out the problem with slurm not running properly.
It had to do with my fstab file being incorrect and not mounting /var/spool
correctly.

Now I can start slurm correctly. However, when trying to run a job, slurm
doesn't load the plugin properly, so it fails with the following message:

sbatch: error: spank: /usr/lib64/slurm/job_submit_require_timelimit.so
exports 0 symbols
sbatch: error: spank: /etc/slurm/plugstack.conf:7: Failed to load plugin
/usr/lib64/slurm/job_submit_require_timelimit.so. Aborting.
sbatch: error: Failed to initialize plugin stack

I posted the slurm.conf and plugstack.conf changes I made in the first
post. Thanks for any help!

On Thu, Mar 19, 2015 at 11:55 AM, Michael Kit Gilbert m...@nau.edu wrote:

  Thank you so much for the reply, Andy. Well, apparently there's a lot
 happening that may be causing the issue. First, I can't seem to get
 slurmctld running properly. When I run slurmctld -D, this is my output:

 slurmctld: error: Can't save state, create file
 /var/spool/slurm/last_config_lite.new error Permission denied
 slurmctld: error: Configured MailProg is invalid
 slurmctld: Job accounting information stored, but details not gathered
 slurmctld: fatal: Incorrect permissions on state save loc: /var/spool/slurm

 I have the MailProg line in slurm.conf commented out, so does it have to
 be specified to work? Also, since I'm root and root is the owner of the
 /var/spool/slurm directory, I'm not sure why it's telling me the
 permissions are incorrect...

 On Thu, Mar 19, 2015 at 10:20 AM, Andy Riebs andy.ri...@hp.com wrote:

  Michael,

 Try running slurmctld -D which should result in output telling you
 what's going wrong.

 Andy



 On 03/19/2015 01:15 PM, Michael Kit Gilbert wrote:

 Sorry for the basic question, but I am new to slurm and am having some
 basic problems with plugins. What I'd like to do is make the
 job_submit_require_timelimit.so plugin that is found in the source code
 active and required for all jobs.

  What I've done so far is I've added the line

  *PluginDir=/usr/lib64/slurm*

  to slurm.conf and I've created a plugstack.conf file that has one line
 in it:

  *requiredjob_submit_require_timelimit.so*

  And now slurm won't start at all. So obviously I've made a huge newbie
 error. I've verified that our plugins are found in the /usr/lib64/slurm
 directory, but I can't tell what else I need to do.
 Does this plugin require arguments? Is there something else I'm missing?

  Thanks,

  Mike






[slurm-dev] Re: Newb question about plugins

2015-03-19 Thread Andy Riebs
   OK, we (or at least I) have reached the point where you need to
 provide some more information:
 * What operating system and version?
 * What Slurm version?
 * What compiler?
 
 You apparently have some kind of build problem, as Slurm plugins are
 required to export a specific set of symbols; they seem not to be
 exported in your plugin.
 
 Have you gotten Slurm to run without the plugin? That's a useful
 first step before adding anything that is optional. (BTW, did you
 discover that MailProg is a requirement, once you get further down
 the road?)
 
 Andy
 
 On 03/19/2015 04:33 PM, Michael Kit
   Gilbert wrote:
   Re: [slurm-dev] Re: Newb question about plugins
   
   Update: So, I have figured out the problem with
 slurm not running properly. It had to do with my fstab file
 being incorrect and not mounting /var/spool correctly.
 Now I can start slurm correctly. However, when trying to
   run a job, slurm doesn't load the plugin properly, so it fails
   with the following message:
   sbatch: error: spank:
 /usr/lib64/slurm/job_submit_require_timelimit.so exports 0
 symbols
   sbatch: error: spank: /etc/slurm/plugstack.conf:7: Failed
 to load plugin
 /usr/lib64/slurm/job_submit_require_timelimit.so. Aborting.
   sbatch: error: Failed to initialize plugin stack
 I posted the slurm.conf and plugstack.conf changes I made
   in the first post. Thanks for any help!
 On Thu, Mar 19, 2015 at 11:55 AM,
   Michael Kit Gilbert m...@nau.edu wrote:
 Thank you so much for the reply, Andy.
   Well, apparently there's a lot happening that may be
   causing the issue. First, I can't seem to get
   slurmctld running properly. When I run slurmctld -D,
   this is my output:
 slurmctld: error: Can't save state, create file
   /var/spool/slurm/last_config_lite.new error
   Permission denied
 slurmctld: error: Configured MailProg is
   invalid
 slurmctld: Job accounting information stored,
   but details not gathered
 slurmctld: fatal: Incorrect permissions on
   state save loc: /var/spool/slurm
   I have the MailProg line in slurm.conf commented
 out, so does it have to be specified to work? Also,
 since I'm root and root is the owner of the
 /var/spool/slurm directory, I'm not sure why it's
 telling me the permissions are incorrect...
 On Thu, Mar 19,
 2015 at 10:20 AM, Andy Riebs andy.ri...@hp.com
 wrote:
  Michael,
   
   Try running slurmctld -D which should result
   in output telling you what's going wrong.
   
   Andy
   On 03/19/2015 01:15 PM, Michael Kit
 Gilbert wrote:
 Sorry for the basic
   question, but I am new to slurm and am
   having some basic problems with
   plugins. What I'd like to do is make
   the job_submit_require_timelimit.so
   plugin that is found in the source
   code active and required for all
   jobs. 
   What I've done so far is I've
 added the line 
   *PluginDir=/usr/lib64/slurm*
   to slurm.conf and I've created a
 plugstack.conf file that has one
 line in it:
   *required      
    job_submit_require_timelimit.so*
   And now slurm won't start at all.
 So obviously I've made a huge newbie
 error. I've verified that our
 plugins are found in the
 /usr/lib64/slurm directory, but I
 can't tell what else I need to do. 
   Does this plugin require
 arguments? Is there something else
 I'm missing?
   Thanks,
   Mike


[slurm-dev] Slurm versions 14.11.5 and 15.08.0-pre3 are now available

2015-03-19 Thread Moe Jette


Version 14.11.5 contains quite a few bug fixes generated over the past  
five weeks including two high impact bugs. There is a fix for the  
slurmdbd daemon aborting if a node is set to a DOWN state and it's  
reason field is NULL. The other important bug fix will prevent  
someone from being able to kill a job array belonging to another user.  
Details about all of the changes are appended.


Version 15.08.0-pre3 represents the current state of Slurm development  
for the release planned in August 2015 and is intended for development  
and test purposes only. Notable enhancements include power capping  
support for Cray systems and add the ability for a compute node to be  
allocated to multiple jobs, but restricted to one user at a time.


Both versions can be downloaded from
http://www.schedmd.com/#repos


* Changes in Slurm 14.11.5
==
 -- Correct the squeue command taking into account that a node can
have NULL name if it is not in DNS but still in slurm.conf.
 -- Fix slurmdbd regression which would cause a segfault when a node is set
down with no reason.
 -- BGQ - Fix issue with job arrays not being handled correctly
in the runjob_mux plugin.
 -- Print FAIR_TREE, if configured, in scontrol show config output for
PriorityFlags.
 -- Add SLURM_JOB_GPUS environment variable to those available in the Prolog.
 -- Load lua-5.2 library if using lua5.2 for lua job submit plugin.
 -- GRES logic: Prevent bad node_offset due to not preserving no_consume flag.
 -- Fix wrong variables used in the wrapper functions needed for systems that
don't support strong_alias
 -- Fix code for apple computers SOL_TCP is not defined
 -- Cray/BASIL - Check for mysql credentials in /root/.my.cnf.
 -- Fix sprio showing wrong priority for job arrays until priority is
recalculated.
 -- Account to batch step all CPUs that are allocated to a job not
just one since the batch step has access to all CPUs like other steps.
 -- Fix job getting EligibleTime set before meeting dependency requirements.
 -- Correct the initialization of QOS MinCPUs per job limit.
 -- Set the debug level of information messages in cgroup plugin to debug2.
 -- For job running under a debugger, if the exec of the task fails, then
cancel its I/O and abort immediately rather than waiting 60 seconds for
I/O timeout.
 -- Fix associations not getting default qos set until after a restart.
 -- Set the value of total_cpus not to be zero before invoking
acct_policy_job_runnable_post_select.
 -- MySQL - When requesting cluster resources, only return resources for the
cluster(s) requested.
 -- Add TaskPluginParam=autobind=threads option to set a default  
binding in the

case that auto binding doesn't find a match.
 -- Introduce a new SchedulerParameters variable nohold_on_prolog_fail.
If configured don't requeue jobs on hold is a Prolog fails.
 -- Make it so sched_params isn't read over and over when an epilog complete
message comes in
 -- Fix squeue -L licenses not filtering out jobs with licenses.
 -- Changed the implementation of xcpuinfo_abs_to_mac() be identical
_abs_to_mac() to fix CPUs allocation using cpuset cgroup.
 -- Improve the explanation of the unbuffered feature in the
srun man page.
 -- Make taskplugin=cgroup work for core spec.  needed to have task/cgroup
before.
 -- Fix reports not using the month usage table.
 -- BGQ - Sanity check given for translating small blocks into slurm  
bg_records.
 -- Fix bug preventing the requeue/hold or requeue/special_exit of  
job from the

completing state.
 -- Cray - Fix for launching batch step within an existing job allocation.
 -- Cray - Add ALPS_APP_ID_ENV environment variable.
 -- Increase maximum MaxArraySize configuration parameter value from 1,000,001
to 4,000,001.
 -- Added new SchedulerParameters value of bf_min_age_reserve. The backfill
scheduler will not reserve resources for pending jobs until they have
been pending for at least the specified number of seconds. This can be
valuable if jobs lack time limits or all time limits have the same value.
 -- Fix support for --mem=0 (all memory of a node) with  
select/cons_res plugin.
 -- Fix bug that can permit someone to kill job array belonging to  
another user.

 -- Don't set the default partition on a license only reservation.
 -- Show a NodeCnt=0, instead of NO_VAL, in scontrol show res for a license
only reservation.
 -- BGQ - When using static small blocks make sure when clearing the job the
block is set up to it's original state.
 -- Start job allocation using lowest numbered sockets for block task
distribution for consistency with cyclic distribution.


* Changes in Slurm 15.08.0pre3
==
-- CRAY - addition of acct_gather_energy/cray plugin.
-- Add job credential to Run Prolog RPC used with a configuration of
   PrologFlags=alloc. This allows the Prolog to be passed identification of
   GPUs allocated to 

[slurm-dev] Re: Newb question about plugins

2015-03-19 Thread Michael Kit Gilbert
Thanks again for the help!
OS: CentOS 6.5
Slurm version: 14.11.2
Compiler: gcc 4.4.7

Getting rid of the plugstack.conf file allowed me to start running jobs
again, but the plugin that I'm wanting to work doesn't appear to be enabled.

There are a bunch of *.so plugin files in the /usr/lib64/slurm directory.
One of them is job_submit_require_timelimit.so. I assume that since these
are installed here that they were compiled when slurm was installed. So
since they're in this directory, how do I enable them? I want people to be
forced to enter a time limit and that is what this plugin appears to do.

On Thu, Mar 19, 2015 at 1:46 PM, Andy Riebs andy.ri...@hp.com wrote:

  OK, we (or at least I) have reached the point where you need to provide
 some more information:
 * What operating system and version?
 * What Slurm version?
 * What compiler?

 You apparently have some kind of build problem, as Slurm plugins are
 required to export a specific set of symbols; they seem not to be exported
 in your plugin.

 Have you gotten Slurm to run without the plugin? That's a useful first
 step before adding anything that is optional. (BTW, did you discover that
 MailProg is a requirement, once you get further down the road?)

 Andy


 On 03/19/2015 04:33 PM, Michael Kit Gilbert wrote:

 Update: So, I have figured out the problem with slurm not running
 properly. It had to do with my fstab file being incorrect and not mounting
 /var/spool correctly.

  Now I can start slurm correctly. However, when trying to run a job,
 slurm doesn't load the plugin properly, so it fails with the following
 message:

  sbatch: error: spank: /usr/lib64/slurm/job_submit_require_timelimit.so
 exports 0 symbols
 sbatch: error: spank: /etc/slurm/plugstack.conf:7: Failed to load plugin
 /usr/lib64/slurm/job_submit_require_timelimit.so. Aborting.
 sbatch: error: Failed to initialize plugin stack

  I posted the slurm.conf and plugstack.conf changes I made in the first
 post. Thanks for any help!

 On Thu, Mar 19, 2015 at 11:55 AM, Michael Kit Gilbert m...@nau.edu
 wrote:

  Thank you so much for the reply, Andy. Well, apparently there's a lot
 happening that may be causing the issue. First, I can't seem to get
 slurmctld running properly. When I run slurmctld -D, this is my output:

  slurmctld: error: Can't save state, create file
 /var/spool/slurm/last_config_lite.new error Permission denied
 slurmctld: error: Configured MailProg is invalid
 slurmctld: Job accounting information stored, but details not gathered
 slurmctld: fatal: Incorrect permissions on state save loc:
 /var/spool/slurm

  I have the MailProg line in slurm.conf commented out, so does it have
 to be specified to work? Also, since I'm root and root is the owner of the
 /var/spool/slurm directory, I'm not sure why it's telling me the
 permissions are incorrect...

 On Thu, Mar 19, 2015 at 10:20 AM, Andy Riebs andy.ri...@hp.com wrote:

  Michael,

 Try running slurmctld -D which should result in output telling you
 what's going wrong.

 Andy



 On 03/19/2015 01:15 PM, Michael Kit Gilbert wrote:

 Sorry for the basic question, but I am new to slurm and am having some
 basic problems with plugins. What I'd like to do is make the
 job_submit_require_timelimit.so plugin that is found in the source code
 active and required for all jobs.

  What I've done so far is I've added the line

  *PluginDir=/usr/lib64/slurm*

  to slurm.conf and I've created a plugstack.conf file that has one line
 in it:

  *requiredjob_submit_require_timelimit.so*

  And now slurm won't start at all. So obviously I've made a huge newbie
 error. I've verified that our plugins are found in the /usr/lib64/slurm
 directory, but I can't tell what else I need to do.
 Does this plugin require arguments? Is there something else I'm missing?

  Thanks,

  Mike