Hi Sergio,

What you describe would require major complex changes and several  
months work. Changes would be required to the data structures in  
src/common/gres.h and code in src/common/gres.c. Major changes would  
also be required to the select plugin of your choice, basically adding  
yet another dimension to it's logic optimizing resource selection. If  
you decide to proceed, I would suggest spending a several weeks  
studying the code and developing a design. Then post the design to  
this mailing list for comment.

Quoting Sergio Iserte Agut <sise...@uji.es>:

> I have a cluster with 3 nodes (one of them has 2 GPUs while the others 1).
>
> I know if I run:
>     *# srun --gres=gpu:3 hostname*
> I would get the error:
> *    srun: error: Unable to allocate resources: Requested node
> configuration is not available*
> Because Slurm is not able to allocate gres within nodes.
> My purpose is Slurm is able to do it, with a global GPUs counter. I have a
> tool which distribute the work within the GPUs of the cluster, that's why I
> would like Slurm schedule and select this GPUs.
>
> I saw some clues in */var/log/slurmctld.log*:
> *     _pick_best_nodes: job 110 never runnable*
> *     _slurm_rpc_allocate_resources: Requested node configuration is not
> available *
> *
> *
> I have spent several days in order to understand where is generated the
> error to start with my implementation.
> And I have discovered this flow among the modules:
>     *node_scheduler.c -> node_select.c -> select_plugin.c -> gres.c*
> However,  I don't know where I can start, because I wouldn't like to modify
> the Slurm Core, I prefer do it with plug-ins.
>
> I hope this is well explained.
>
> Regards,
>     Sergio Iserte.
>

Reply via email to