Hello all,

I've been pouring through the slurm-dev archive trying to find how to modify 
the partition plugin to do my bidding and I've come up empty, might be there 
and I just missed it but I can't dig any longer and am resigned to ask for some 
help from someone more familiar with the problem than I.

Let me start by describing my problem, We run a heterogeneous cluster with some 
Infinniband enabled nodes and some non infiniband enabled nodes, we also have 
some Researchers who have put money down and bought hardware to in our cluster 
and in return we provide priority access to those nodes. My ideal config to get 
the best usage out of what is under the Headnode's purview is to have all nodes 
in either a serial partition or a mpi partition and these partitions are set to 
preempt by re-queuing the job based on partition priority, the priority on 
these are set low, then I am putting the researcher nodes in a high priority 
partition that is set to prevent preemption.

So my partitions look like this

Researcher owned MaxNodes=#of nodes owned
^
|
Serial MaxNodes=1
^
|
MPI MaxNodes=infinite

I would like the job_submit partition plugin to route jobs automatically based 
on number of nodes required for that job. From what I can see in the code it 
only does a check to make sure that max mem per cpu for the partition is less 
than that which has been requested by the user and that user is a member of the 
groups allowed to submit to that partition out of the box. I added a node 
validator block to it to attempt to do the job min_nodes to the partition 
max_nodes setting.

static bool _valid_nodes (struct part_record *part_ptr,
                          struct job_descriptor *job_desc)
{
                uint32_t job_limit, part_limit;
        job_limit  =  job_desc->min_nodes;
        part_limit =  part_ptr->max_nodes;

        if (job_limit > part_limit) {
                debug("job_submit/partition: skipping partition %s due to "
                      "node limit (%u > %u)",
                      part_ptr->name, job_limit, part_limit);
                return false;

        }
        return true;
}

Then later on I added it to the iterator alongside the existing one.

if (!_valid_nodes(part_ptr, job_desc))
                                continue;

It compiles and doesn't complain, I put the .so in my slurm libs directory and 
set JobSubmitPlugins=partition in slurm.conf.

In testing I have found that my node validator will only return true when I 
have the partition node limit set to infinite because everything goes in MPI 
until I define the max nodes for that partition then it simply never returns a 
SLURM_SUCCESS. I don't know what I need to set to get the debugging ouput to 
show up somewhere that I can look at what values it is trying to compare, but 
using the debug flag on slurmctld just gets me the output of the log printed to 
Std. out.

Anyone have anything to offer that might help me get this configured?
Thanks,
Buddy.


Reply via email to