Nicholas,

I can confirm that I get the same result as you, and I now realize my
mistake. I executed
sbatch ls_test -N 2
rather than
sbatch -N 2 ls_test

Which is a rather silly thing to get tripped up over.

Thanks,
Nathan

On 28 June 2017 at 16:26, Nicholas McCollum <[email protected]> wrote:

> Try this:
>
> [root@dmc197 ~]# sbatch ls_test
> sbatch: error:
> Job requested Min Nodes: 4294967294.
>
> sbatch: error: Batch job submission failed: Unspecified error
> [root@dmc197 ~]# sbatch -N 2 ls_test
> sbatch: error:
> Job requested Min Nodes: 2.
>
> sbatch: error: Batch job submission failed: Unspecified error
> [root@dmc197 ~]# cat /etc/slurm/job_submit.lua
> function slurm_job_modify(job_desc, part_list, submit_uid)
> end
>
> function slurm_job_submit(job_desc, part_list, submit_uid)
>   local test_min_nodes = job_desc.min_nodes
>   error_verbose = string.format("Job requested Min Nodes: %s.\n",
> test_min_nodes)
>   slurm.log_user("\n%s", error_verbose)
>   return slurm.ERROR
> end
>
>
> --
> Nicholas McCollum
> HPC Systems Administrator
> Alabama Supercomputer Authority
>
>
>
> On Wed, 2017-06-28 at 13:51 -0600, Nathan Vance wrote:
> > Correction (copy/pasted wrong thing): It was the
> > "JobSubmitPlugins=lua" line in slurm.conf, not "job_submit.lua:
> > initialized", that did the trick.
> >
> > At least, I thought that was the end of the story. Now I'm getting
> > odd errors with reading job_desc and part_list that behave, in my
> > estimate, like lua's receiving a bad pointer to the underlying c data
> > structure.
> >
> > On ubuntu, the unedited job_submit.lua provided with the sample code
> > runs without crashing, though it does not respect the --
> > partition="foo" flag in sbatch as the source code suggests it should.
> > When edited to include slurm.log_info("bar"), the script crashes
> > with:
> > /etc/slurm/job_submit.lua:38: attempt to compare number with nil
> > The fact that behaviour changes based on the presence of unrelated
> > code makes me think that this is a pointer issue, but I don't know
> > enough about the compilation of lua to bytecode to diagnose it.
> >
> > On centos, with or without the log command, it crashes at the same
> > point as on ubuntu.
> >
> > On both:
> > When I comment out the example code so that it doesn't crash, then
> > try to print out values in job_desc, I get some really odd results.
> > For example, job_desc.min_nodes is 4294967294 (on both systems),
> > regardless of what I set with sbatch job.sh --nodes=X. At first I
> > thought that slurm gave my lua script a bad pointer to something that
> > had already been garbage collected, but then I discovered that if I
> > hard code something in lua such as job_desc.min_nodes=X, then slurm
> > assigns X nodes to the job. So perhaps slurm respects what lua
> > populates job_desc with, but slurm initially fills it with arbitrary
> > values?
> >
> > Here's the lua script I used for the above experiments:
> > ======== BEGIN job_submit.lua ========
> > function slurm_job_submit(job_desc, part_list, submit_uid)
> >     slurm.log_info(job_desc.min_nodes)
> >     job_desc.min_nodes=5
> >     return slurm.SUCCESS
> > end
> >
> > function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
> >     return slurm.SUCCESS
> > end
> >
> > slurm.log_info("initialized")
> > return slurm.SUCCESS
> > ======== END job_submit.lua ========
> >
> > As an aside, it looks like job_desc uses job_descriptor under the
> > hood:
> > https://github.com/SchedMD/slurm/blob/master/slurm/slurm.h.in#L1373-L
> > 1553
> > As I wasn't positive, I experimented first using job_desc.qos, which
> > Nicholas indicated should be supported, but while it exhibited
> > similar behaviour to min_nodes, it didn't fail quite as
> > spectacularly.
> > I couldn't figure out what structure backs part_list. The
> > documentation at https://slurm.schedmd.com/job_submit_plugins.html
> > isn't clear when all it says is that it's a "List of pointer to
> > partitions which this user is authorized to use." [sic]
> >
> > I'm still using slurm 17.02.5. On ubuntu I'm using lua5.2, and on
> > centos it's lua5.1. In both cases, lua (both the interpreter and the
> > dev libraries) were installed from the repositories, and slurm was
> > built from source.
> >
> > It seems like I filled an email with a whole lot of complaints and no
> > real questions. So, is this a configuration error on my end? Should I
> > suck it up and write my plugin in c, even though I don't need full
> > access to slurmctld? Should I switch to using slurm-wlm? Should I
> > open a bug report?
> >
> > Thanks,
> > Nathan
> >
> > On 27 June 2017 at 17:07, Nicholas McCollum <[email protected]>
> > wrote:
> > > Nathan,
> > >
> > > I have very much appreciated the job_submit.lua plugin for helping
> > > educate users on what is an acceptable job.  It is one of my
> > > favorite
> > > features about SLURM and has been invaluable in assisting students
> > > in
> > > submitting valid job requirements.
> > >
> > > If a user specifies some absurd amount of memory, or some other
> > > sbatch
> > > or srun parameter... or does not choose a parameter, I like to
> > > notify
> > > the user what they have done wrong.  For example I require all
> > > users to
> > > specify a QoS when they submit a job.
> > >
> > > ====== BEGIN EXAMPLE job_submit.lua ======
> > >
> > > function slurm_job_modify(job_desc, part_list, submit_uid)
> > > end
> > >
> > > function slurm_job_submit(job_desc, part_list, submit_uid)
> > >
> > >     --[[ Start with an error count of 0 ]]--
> > >   local asc_error = 0
> > >   local asc_error_verbose = ""
> > >
> > >   --[[ Pretend if statement ]]--
> > >     asc_error = asc_error + 1
> > >     asc_error_verbose = string.format("%s\nERROR: Job requested
> > > something we dont like.\n", asc_error_verbose)
> > >   --[[ End Pretend if statement ]]--
> > >
> > >   --[[ Pretend if statement ]]--
> > >     asc_error = asc_error + 1
> > >     asc_error_verbose = string.format("%s\nERROR: More bad
> > > stuff.\n",
> > > asc_error_verbose)
> > >   --[[ End Pretend if statement ]]--
> > >
> > >   if asc_error > 0 then
> > >     slurm.log_user("\n%s", asc_error_verbose)
> > >     return slurm.ERROR
> > >   end
> > >
> > >   --[[ Want to return slurm.SUCCESS if the entire script runs to
> > > end
> > > ]]--
> > >   return slurm.SUCCESS
> > > end
> > >
> > > ====== END EXAMPLE job_submit.lua =======
> > >
> > > This is the method that I worked out, where it collects all of the
> > > errors inside asc_error_verbose and dumps out at the end with
> > > return
> > > slurm.ERROR.   If you use the current file above, it will return
> > > every
> > > job with those errors above.  This would be a great way to check
> > > that
> > > job_submit.lua is working on your system.  If you have any current
> > > jobs
> > > though, it will kill them all... so use this on a development
> > > environment for testing.
> > >
> > > My example for making a user specify a QoS:
> > >
> > >   local asc_qos = job_desc.qos
> > >   if asc_qos == nil then
> > >     asc_error = asc_error + 1
> > >     asc_error_verbose = string.format("%s\nJob must request a QoS
> > > using
> > > the --qos= flag.\n",asc_error_verbose)
> > >     asc_qos = "invalid"
> > >   end
> > >
> > >
> > > I'd be more than happy to share my job_submit.lua if anyone is
> > > interested.  I only ask that you share yours back.
> > >
> > > --
> > > Nicholas McCollum
> > > HPC Systems Administrator
> > > Alabama Supercomputer Authority
> > >
> > > On Tue, 2017-06-27 at 14:30 -0600, Nathan Vance wrote:
> > > > Darby,
> > > >
> > > > The "job_submit.lua: initialized" line in slurm.conf was indeed
> > > the
> > > > issue. When compiling slurm I only got the "yes lua" line without
> > > the
> > > > flags, but that seems to be just a difference in OS's.
> > > >
> > > > Now that I have debugging feedback I should be good to go!
> > > >
> > > > Thanks,
> > > > Nathan
> > > >
> > > > On 27 June 2017 at 16:13, Vicker, Darby (JSC-EG311) <darby.vicker
> > > -1@n
> > > > asa.gov> wrote:
> > > > > We recently started using a lua job submit plugin as well.  You
> > > > > have to have the lua-devel package installed when you compile
> > > > > slurm.  It looks like you do (but we use RHEL the package name
> > > is
> > > > > lua-devel) but confirm that you see something like these in
> > > > > config.log:
> > > > >
> > > > > configure:24784: result: yes lua
> > > > > pkg_cv_lua_LIBS='-llua -lm -ldl  '
> > > > > lua_CFLAGS='  -DLUA_COMPAT_ALL'
> > > > > lua_LIBS='-llua -lm -ldl  '
> > > > >
> > > > > Do you have this in your slurm.conf?
> > > > >
> > > > > JobSubmitPlugins=lua
> > > > >
> > > > > I'm guessing not given you don't see anything in the logs.
> > > Before I
> > > > > got all the errors worked out, I would see errors like this in
> > > > > slurmctld_log:
> > > > >
> > > > > error: Couldn't find the specified plugin name for
> > > job_submit/lua
> > > > > looking at all files
> > > > > error: cannot find job_submit plugin for job_submit/lua
> > > > > error: cannot create job_submit context for job_submit/lua
> > > > > failed to initialize job_submit plugin
> > > > >
> > > > >
> > > > > After getting everything working, you should see this:
> > > > >
> > > > > job_submit.lua: initialized
> > > > >
> > > > > As well as any other slurm.log_info messages you put in your
> > > lua
> > > > > script.
> > > > >
> > > > >
> > > > > From: Nathan Vance <[email protected]>
> > > > > Reply-To: slurm-dev <[email protected]>
> > > > > Date: Tuesday, June 27, 2017 at 12:15 PM
> > > > > To: slurm-dev <[email protected]>
> > > > > Subject: [slurm-dev] Job Submit Lua Plugin
> > > > >
> > > > > Hello all!
> > > > >
> > > > > I've been working on getting off the ground with Lua plugins.
> > > The
> > > > > goal is to implement Torque's routing queues for SLURM, but so
> > > far
> > > > > I have been unable to get SLURM to even call my plugin.
> > > > >
> > > > > What I have tried:
> > > > > 1) Copied contrib/lua/job_submit.lua to /etc/slurm/ (the same
> > > > > directory as slurm.conf)
> > > > > 2) Restarted slurmctld and verified that no functionality was
> > > > > broken
> > > > > 3) Added slurm.log_info("I got here") to several points in the
> > > > > script. After restarting slurmctld and submitting a job, grep
> > > "I
> > > > > got here" -R /var/log found no results.
> > > > > 4) In case there was a problem with the log file, I added
> > > > > os.execute("touch /home/myUser/slurm_job_submitted") to the top
> > > of
> > > > > the slurm_job_submit method. Restarting slurmctld and
> > > submitting a
> > > > > job still produced no evidence that my plugin was called.
> > > > > 5) In case there were permission issues, I made job_submit.lua
> > > > > executable. Nothing. Even grep "job_submit" -R /var/log (in
> > > case
> > > > > there was an error calling the script) comes up dry.
> > > > >
> > > > > Relevant information:
> > > > > OS: Ubuntu 16.04
> > > > > Lua: lua5.2 and liblua5.2-dev (I can use Lua interactively)
> > > > > SLURM version: 17.02.5, compiled from source (after installing
> > > Lua)
> > > > > using ./configure --prefix=/usr --sysconfdir=/etc/slurm
> > > > >
> > > > > Any guidance to get me up and running would be greatly
> > > appreciated!
> > > > >
> > > > > Thanks,
> > > > > Nathan
> > > >
> > > >
> >
> >
>

Reply via email to