So, I have this figured out, I felt pretty dumb when I traced back the debug 
function and found it was as easy as setting the flag in slurm.conf.

Here are the sections I added to the job_submit_partition.c to accomplish what 
I was after below.

                /* This will compare the number of nodes requested to the 
number of nodes a user can request for the partition */

static bool _valid_nodes (struct part_record *part_ptr,
                          struct job_descriptor *job_desc)
{
                                uint32_t job_limit, part_limit;
                /* If the value is undefined in job_desc it throws back the max 
value for a 32 bit int -1 */
                if (job_desc->num_tasks != UINT32_MAX-1)
                {              job_limit  =  job_desc->num_tasks;
                                part_limit =  part_ptr->max_nodes;
                }
                if (job_desc->min_nodes != UINT32_MAX-1)
                {              job_limit  =  job_desc->min_nodes;
                                part_limit =  part_ptr->max_nodes;
                }
                if (job_limit > part_limit)
                {        debug("job_submit/partition: skipping partition %s due 
to "
                      "node limit (%u > %u)",
                      part_ptr->name, job_limit, part_limit);
                                return false;
                }
                return true;
}
                /* This will check the number of total cpus in the partition to 
that which is requested */
static bool _valid_cpu (struct part_record *part_ptr,
                          struct job_descriptor *job_desc)
{
                                uint32_t job_limit, part_limit;
                                /* If the value is undefined in job_desc it 
throws back the max value for a 32 bit int -1 */
                                if (job_desc->min_cpus != UINT32_MAX-1){
                                                job_limit  = job_desc->min_cpus;
                                                part_limit = 
part_ptr->total_cpus;
                                }
                                if (job_limit > part_limit){
                debug("job_submit/partition: skipping partition %s due to "
                      "cpu limit (%u > %u)",
                      part_ptr->name, job_limit, part_limit);
                                return false;
                                }
                return true;
}


/* this part gets put in where it tests job elements */

                                                if (!_valid_nodes(part_ptr, 
job_desc))
                                                                continue;
                                                if (!_valid_cpu(part_ptr, 
job_desc))
                                                                continue;

What this adds is a check to see how many nodes the job requests, if that value 
is defined, also number of tasks if that value is defined and does a compare on 
those values with  partition limits. Then it does a cpu check to make sure that 
there is actually enough cpus in the partition to run the job.


Hope this helps someone doing something similar to what we are.

Buddy.


From: Scharfenberg, Buddy Lee
Sent: Thursday, April 23, 2015 9:10 AM
To: '[email protected]'
Subject: Job Submit plugin help

Hello all,

I've been pouring through the slurm-dev archive trying to find how to modify 
the partition plugin to do my bidding and I've come up empty, might be there 
and I just missed it but I can't dig any longer and am resigned to ask for some 
help from someone more familiar with the problem than I.

Let me start by describing my problem, We run a heterogeneous cluster with some 
Infinniband enabled nodes and some non infiniband enabled nodes, we also have 
some Researchers who have put money down and bought hardware to in our cluster 
and in return we provide priority access to those nodes. My ideal config to get 
the best usage out of what is under the Headnode's purview is to have all nodes 
in either a serial partition or a mpi partition and these partitions are set to 
preempt by re-queuing the job based on partition priority, the priority on 
these are set low, then I am putting the researcher nodes in a high priority 
partition that is set to prevent preemption.

So my partitions look like this

Researcher owned MaxNodes=#of nodes owned
^
|
Serial MaxNodes=1
^
|
MPI MaxNodes=infinite

I would like the job_submit partition plugin to route jobs automatically based 
on number of nodes required for that job. From what I can see in the code it 
only does a check to make sure that max mem per cpu for the partition is less 
than that which has been requested by the user and that user is a member of the 
groups allowed to submit to that partition out of the box. I added a node 
validator block to it to attempt to do the job min_nodes to the partition 
max_nodes setting.

static bool _valid_nodes (struct part_record *part_ptr,
                          struct job_descriptor *job_desc)
{
                uint32_t job_limit, part_limit;
        job_limit  =  job_desc->min_nodes;
        part_limit =  part_ptr->max_nodes;

        if (job_limit > part_limit) {
                debug("job_submit/partition: skipping partition %s due to "
                      "node limit (%u > %u)",
                      part_ptr->name, job_limit, part_limit);
                return false;

        }
        return true;
}

Then later on I added it to the iterator alongside the existing one.

if (!_valid_nodes(part_ptr, job_desc))
                                continue;

It compiles and doesn't complain, I put the .so in my slurm libs directory and 
set JobSubmitPlugins=partition in slurm.conf.

In testing I have found that my node validator will only return true when I 
have the partition node limit set to infinite because everything goes in MPI 
until I define the max nodes for that partition then it simply never returns a 
SLURM_SUCCESS. I don't know what I need to set to get the debugging ouput to 
show up somewhere that I can look at what values it is trying to compare, but 
using the debug flag on slurmctld just gets me the output of the log printed to 
Std. out.

Anyone have anything to offer that might help me get this configured?
Thanks,
Buddy.


Reply via email to