Nicholas, I can confirm that I get the same result as you, and I now realize my mistake. I executed sbatch ls_test -N 2 rather than sbatch -N 2 ls_test
Which is a rather silly thing to get tripped up over. Thanks, Nathan On 28 June 2017 at 16:26, Nicholas McCollum <[email protected]> wrote: > Try this: > > [root@dmc197 ~]# sbatch ls_test > sbatch: error: > Job requested Min Nodes: 4294967294. > > sbatch: error: Batch job submission failed: Unspecified error > [root@dmc197 ~]# sbatch -N 2 ls_test > sbatch: error: > Job requested Min Nodes: 2. > > sbatch: error: Batch job submission failed: Unspecified error > [root@dmc197 ~]# cat /etc/slurm/job_submit.lua > function slurm_job_modify(job_desc, part_list, submit_uid) > end > > function slurm_job_submit(job_desc, part_list, submit_uid) > local test_min_nodes = job_desc.min_nodes > error_verbose = string.format("Job requested Min Nodes: %s.\n", > test_min_nodes) > slurm.log_user("\n%s", error_verbose) > return slurm.ERROR > end > > > -- > Nicholas McCollum > HPC Systems Administrator > Alabama Supercomputer Authority > > > > On Wed, 2017-06-28 at 13:51 -0600, Nathan Vance wrote: > > Correction (copy/pasted wrong thing): It was the > > "JobSubmitPlugins=lua" line in slurm.conf, not "job_submit.lua: > > initialized", that did the trick. > > > > At least, I thought that was the end of the story. Now I'm getting > > odd errors with reading job_desc and part_list that behave, in my > > estimate, like lua's receiving a bad pointer to the underlying c data > > structure. > > > > On ubuntu, the unedited job_submit.lua provided with the sample code > > runs without crashing, though it does not respect the -- > > partition="foo" flag in sbatch as the source code suggests it should. > > When edited to include slurm.log_info("bar"), the script crashes > > with: > > /etc/slurm/job_submit.lua:38: attempt to compare number with nil > > The fact that behaviour changes based on the presence of unrelated > > code makes me think that this is a pointer issue, but I don't know > > enough about the compilation of lua to bytecode to diagnose it. > > > > On centos, with or without the log command, it crashes at the same > > point as on ubuntu. > > > > On both: > > When I comment out the example code so that it doesn't crash, then > > try to print out values in job_desc, I get some really odd results. > > For example, job_desc.min_nodes is 4294967294 (on both systems), > > regardless of what I set with sbatch job.sh --nodes=X. At first I > > thought that slurm gave my lua script a bad pointer to something that > > had already been garbage collected, but then I discovered that if I > > hard code something in lua such as job_desc.min_nodes=X, then slurm > > assigns X nodes to the job. So perhaps slurm respects what lua > > populates job_desc with, but slurm initially fills it with arbitrary > > values? > > > > Here's the lua script I used for the above experiments: > > ======== BEGIN job_submit.lua ======== > > function slurm_job_submit(job_desc, part_list, submit_uid) > > slurm.log_info(job_desc.min_nodes) > > job_desc.min_nodes=5 > > return slurm.SUCCESS > > end > > > > function slurm_job_modify(job_desc, job_rec, part_list, modify_uid) > > return slurm.SUCCESS > > end > > > > slurm.log_info("initialized") > > return slurm.SUCCESS > > ======== END job_submit.lua ======== > > > > As an aside, it looks like job_desc uses job_descriptor under the > > hood: > > https://github.com/SchedMD/slurm/blob/master/slurm/slurm.h.in#L1373-L > > 1553 > > As I wasn't positive, I experimented first using job_desc.qos, which > > Nicholas indicated should be supported, but while it exhibited > > similar behaviour to min_nodes, it didn't fail quite as > > spectacularly. > > I couldn't figure out what structure backs part_list. The > > documentation at https://slurm.schedmd.com/job_submit_plugins.html > > isn't clear when all it says is that it's a "List of pointer to > > partitions which this user is authorized to use." [sic] > > > > I'm still using slurm 17.02.5. On ubuntu I'm using lua5.2, and on > > centos it's lua5.1. In both cases, lua (both the interpreter and the > > dev libraries) were installed from the repositories, and slurm was > > built from source. > > > > It seems like I filled an email with a whole lot of complaints and no > > real questions. So, is this a configuration error on my end? Should I > > suck it up and write my plugin in c, even though I don't need full > > access to slurmctld? Should I switch to using slurm-wlm? Should I > > open a bug report? > > > > Thanks, > > Nathan > > > > On 27 June 2017 at 17:07, Nicholas McCollum <[email protected]> > > wrote: > > > Nathan, > > > > > > I have very much appreciated the job_submit.lua plugin for helping > > > educate users on what is an acceptable job. It is one of my > > > favorite > > > features about SLURM and has been invaluable in assisting students > > > in > > > submitting valid job requirements. > > > > > > If a user specifies some absurd amount of memory, or some other > > > sbatch > > > or srun parameter... or does not choose a parameter, I like to > > > notify > > > the user what they have done wrong. For example I require all > > > users to > > > specify a QoS when they submit a job. > > > > > > ====== BEGIN EXAMPLE job_submit.lua ====== > > > > > > function slurm_job_modify(job_desc, part_list, submit_uid) > > > end > > > > > > function slurm_job_submit(job_desc, part_list, submit_uid) > > > > > > --[[ Start with an error count of 0 ]]-- > > > local asc_error = 0 > > > local asc_error_verbose = "" > > > > > > --[[ Pretend if statement ]]-- > > > asc_error = asc_error + 1 > > > asc_error_verbose = string.format("%s\nERROR: Job requested > > > something we dont like.\n", asc_error_verbose) > > > --[[ End Pretend if statement ]]-- > > > > > > --[[ Pretend if statement ]]-- > > > asc_error = asc_error + 1 > > > asc_error_verbose = string.format("%s\nERROR: More bad > > > stuff.\n", > > > asc_error_verbose) > > > --[[ End Pretend if statement ]]-- > > > > > > if asc_error > 0 then > > > slurm.log_user("\n%s", asc_error_verbose) > > > return slurm.ERROR > > > end > > > > > > --[[ Want to return slurm.SUCCESS if the entire script runs to > > > end > > > ]]-- > > > return slurm.SUCCESS > > > end > > > > > > ====== END EXAMPLE job_submit.lua ======= > > > > > > This is the method that I worked out, where it collects all of the > > > errors inside asc_error_verbose and dumps out at the end with > > > return > > > slurm.ERROR. If you use the current file above, it will return > > > every > > > job with those errors above. This would be a great way to check > > > that > > > job_submit.lua is working on your system. If you have any current > > > jobs > > > though, it will kill them all... so use this on a development > > > environment for testing. > > > > > > My example for making a user specify a QoS: > > > > > > local asc_qos = job_desc.qos > > > if asc_qos == nil then > > > asc_error = asc_error + 1 > > > asc_error_verbose = string.format("%s\nJob must request a QoS > > > using > > > the --qos= flag.\n",asc_error_verbose) > > > asc_qos = "invalid" > > > end > > > > > > > > > I'd be more than happy to share my job_submit.lua if anyone is > > > interested. I only ask that you share yours back. > > > > > > -- > > > Nicholas McCollum > > > HPC Systems Administrator > > > Alabama Supercomputer Authority > > > > > > On Tue, 2017-06-27 at 14:30 -0600, Nathan Vance wrote: > > > > Darby, > > > > > > > > The "job_submit.lua: initialized" line in slurm.conf was indeed > > > the > > > > issue. When compiling slurm I only got the "yes lua" line without > > > the > > > > flags, but that seems to be just a difference in OS's. > > > > > > > > Now that I have debugging feedback I should be good to go! > > > > > > > > Thanks, > > > > Nathan > > > > > > > > On 27 June 2017 at 16:13, Vicker, Darby (JSC-EG311) <darby.vicker > > > -1@n > > > > asa.gov> wrote: > > > > > We recently started using a lua job submit plugin as well. You > > > > > have to have the lua-devel package installed when you compile > > > > > slurm. It looks like you do (but we use RHEL the package name > > > is > > > > > lua-devel) but confirm that you see something like these in > > > > > config.log: > > > > > > > > > > configure:24784: result: yes lua > > > > > pkg_cv_lua_LIBS='-llua -lm -ldl ' > > > > > lua_CFLAGS=' -DLUA_COMPAT_ALL' > > > > > lua_LIBS='-llua -lm -ldl ' > > > > > > > > > > Do you have this in your slurm.conf? > > > > > > > > > > JobSubmitPlugins=lua > > > > > > > > > > I'm guessing not given you don't see anything in the logs. > > > Before I > > > > > got all the errors worked out, I would see errors like this in > > > > > slurmctld_log: > > > > > > > > > > error: Couldn't find the specified plugin name for > > > job_submit/lua > > > > > looking at all files > > > > > error: cannot find job_submit plugin for job_submit/lua > > > > > error: cannot create job_submit context for job_submit/lua > > > > > failed to initialize job_submit plugin > > > > > > > > > > > > > > > After getting everything working, you should see this: > > > > > > > > > > job_submit.lua: initialized > > > > > > > > > > As well as any other slurm.log_info messages you put in your > > > lua > > > > > script. > > > > > > > > > > > > > > > From: Nathan Vance <[email protected]> > > > > > Reply-To: slurm-dev <[email protected]> > > > > > Date: Tuesday, June 27, 2017 at 12:15 PM > > > > > To: slurm-dev <[email protected]> > > > > > Subject: [slurm-dev] Job Submit Lua Plugin > > > > > > > > > > Hello all! > > > > > > > > > > I've been working on getting off the ground with Lua plugins. > > > The > > > > > goal is to implement Torque's routing queues for SLURM, but so > > > far > > > > > I have been unable to get SLURM to even call my plugin. > > > > > > > > > > What I have tried: > > > > > 1) Copied contrib/lua/job_submit.lua to /etc/slurm/ (the same > > > > > directory as slurm.conf) > > > > > 2) Restarted slurmctld and verified that no functionality was > > > > > broken > > > > > 3) Added slurm.log_info("I got here") to several points in the > > > > > script. After restarting slurmctld and submitting a job, grep > > > "I > > > > > got here" -R /var/log found no results. > > > > > 4) In case there was a problem with the log file, I added > > > > > os.execute("touch /home/myUser/slurm_job_submitted") to the top > > > of > > > > > the slurm_job_submit method. Restarting slurmctld and > > > submitting a > > > > > job still produced no evidence that my plugin was called. > > > > > 5) In case there were permission issues, I made job_submit.lua > > > > > executable. Nothing. Even grep "job_submit" -R /var/log (in > > > case > > > > > there was an error calling the script) comes up dry. > > > > > > > > > > Relevant information: > > > > > OS: Ubuntu 16.04 > > > > > Lua: lua5.2 and liblua5.2-dev (I can use Lua interactively) > > > > > SLURM version: 17.02.5, compiled from source (after installing > > > Lua) > > > > > using ./configure --prefix=/usr --sysconfdir=/etc/slurm > > > > > > > > > > Any guidance to get me up and running would be greatly > > > appreciated! > > > > > > > > > > Thanks, > > > > > Nathan > > > > > > > > > > > > >
