I'm attempting to come up with a Lua job_submit plugin based off the example 
from the slurm src that assigns jobs to a QOS based on that QOS' currently 
allocated resources.

So right now we have the following partitions:

PartitionName=serial Nodes=c[0101-0104] Priority=100 
AllowQOS=hepx,idhmc,general,aglife MaxNodes=1 MaxTime=120:00:00 PreemptMode=OFF 
State=UP
PartitionName=mpi_core8 Nodes=c[0925-0926]n[1-2] Priority=100 AllowQOS=mpi 
MinNodes=2 MaxTime=48:00:00 PreemptMode=OFF State=UP
PartitionName=mpi_core32 Nodes=c[0133-0134],c[0237-0238],c[0934-0936] 
Priority=100 AllowQOS=mpi MinNodes=2 MaxTime=48:00:00 PreemptMode=OFF State=UP
PartitionName=background Priority=10 AllowQOS=background,grid MaxTime=96:00:00 
State=UP

We use partition preemption which is why the "background" partition exists.

Our desire is that users do not have to choose a QOS and that a QOS for 
stakeholders is chosen based off usage.  So if the "hepx" QOS is running all 
their stakeholder CPUs then the submit plugin will assign them to the "general" 
QOS to run those additional jobs like a non-stakeholder.

To achieve this I used a command like the following

  local cmd = "squeue --qos=" .. qos .. " --states=R --partition=" .. partition 
.. " --noheader --format='%C' | paste -sd+ | bc"

The output is captured using io.popen.  Unfortunately when I perform any sbatch 
submission that requires the cmd to be executed I receive the following:

# sbatch --uid testuser_hepx -n2 -p serial batches/job_submit_lua_test.slrm
sbatch: error: slurm_receive_msg: Socket timed out on send/recv operation
sbatch: error: Batch job submission failed: Socket timed out on send/recv 
operation

I've changing Scheduler Parameters to 
"SchedulerParameters=batch_sched_delay=10,defer" with no luck and based off old 
mailing list topics I tried setting "net.ipv4.tcp_max_syn_backlog" to 8192 , 
and still the same problem.  I notice in the logs that there is about a 10 
second pause during the execution of that shell command.  When I run it from 
the command line I have no delay.

My guess is that while submitting a job it is unwise or impossible to at the 
same time execute a "squeue" via the same process submitting the job 
(job_submit.lua).

Is there another way to achieve this functionality?  I've uploaded my current 
script to here, https://gist.github.com/treydock/b964c5599fd057b0aa6a

Thanks,
- Trey

=============================

Trey Dockendorf 
Systems Analyst I 
Texas A&M University 
Academy for Advanced Telecommunications and Learning Technologies 
Phone: (979)458-2396 
Email: [email protected] 
Jabber: [email protected]

Reply via email to