Re: [slurm-users] [EXT] Re: EXTERNAL-Re: [External] scancel gpu jobs when gpu is not requested

Sean Crosby Mon, 30 Aug 2021 22:42:43 -0700

Hi Fritz,

job_submit_lua.so gets made upon compilation of Slurm if you have the lua-devel 
package installed at the time of configure/make.

Sean
________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of 
Ratnasamy, Fritz <fritz.ratnas...@chicagobooth.edu>
Sent: Tuesday, 31 August 2021 15:05
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: [EXT] Re: [slurm-users] EXTERNAL-Re: [External] scancel gpu jobs when 
gpu is not requested

External email: Please exercise caution

________________________________
Hi Michael,

Thanks for your message. Does the installation of the library job_submit_lua.so 
need to have Slurm recompiled as well, ie, do I have to compile slurm with the 
library job_submit_lua.so to be able to add any plugin?I do not see it in the 
yum repo.
Thanks,

Fritz Ratnasamy

Data Scientist

Information Technology

The University of Chicago

Booth School of Business

5807 S. Woodlawn

Chicago, Illinois 60637

Phone: +(1) 773-834-4556

On Thu, Aug 26, 2021 at 9:18 AM Michael Robbert 
<mrobb...@mines.edu<mailto:mrobb...@mines.edu>> wrote:

You need to set the following option in slurm.conf

JobSubmitPlugins

A comma delimited list of job submission plugins to be used. The specified 
plugins will be executed in the order listed. These are intended to be 
site-specific plugins which can be used to set default job parameters and/or 
logging events. Sample plugins available in the distribution include 
"all_partitions", "defaults", "logging", "lua", and "partition". For examples 
of use, see the Slurm code in "src/plugins/job_submit" and 
"contribs/lua/job_submit*.lua" then modify the code to satisfy your needs. 
Slurm can be configured to use multiple job_submit plugins if desired, however 
the lua plugin will only execute one lua script named "job_submit.lua" located 
in the default script directory (typically the subdirectory "etc" of the 
installation directory). No job submission plugins are used by default.

Then as this documentation states, put the job_submit.lua into your script 
directory. Mine is in /etc/slurm/. You may want to make sure that you have the 
job_submit_lua.so library installed with your build of Slurm. I agree that 
finding complete documentation for this feature is a little difficult.

Mike

From: slurm-users 
<slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Ratnasamy, Fritz 
<fritz.ratnas...@chicagobooth.edu<mailto:fritz.ratnas...@chicagobooth.edu>>
Date: Wednesday, August 25, 2021 at 23:13
To: Slurm User Community List 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] EXTERNAL-Re: [External] scancel gpu jobs when gpu is 
not requested

Hi Michael,

Thanks for your message. Yes I was able to get all interactive sessions killed 
quickly when trying other partitions and deactivating the prolog. I read your 
example and I understand how it could possibly work (in the ex., maybe instead 
of looking if the gpu model is passed, we could look at the number of gpu 
passed?), but where do i set up that function and where do i call it?
Thanks,

Fritz Ratnasamy

Data Scientist

Information Technology

The University of Chicago

Booth School of Business

5807 S. Woodlawn

Chicago, Illinois 60637

Phone: +(1) 773-834-4556

On Wed, Aug 25, 2021 at 9:54 AM Michael Robbert 
<mrobb...@mines.edu<mailto:mrobb...@mines.edu>> wrote:

I doubt that it is a problem with your script and suspect that there is some 
weird interaction with scancel on interactive jobs. If you wanted to get to the 
bottom of that I’d suggest disabling the prolog and test by manually cancelling 
some interactive jobs.

Another suggestion is to try a completely different approach to solve your 
problem. Why wait until the job starts to do the check? You can use a submit 
filter and it will alert the user as soon as they try to submit. That will 
prevent them from potentially having to wait in the queue if the cluster is 
busy and gets around having to cancel a running job. There is a description and 
simple example at the bottom of this page: 
https://slurm.schedmd.com/resource_limits.html<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fresource_limits.html&data=04%7C01%7Cmrobbert%40mines.edu%7C577fad20cd024e8f8d5a08d96850336c%7C997209e009b346239a4d76afa44a675c%7C0%7C0%7C637655515944014175%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vnuQFWtAkvixWlJaLCVa%2Bcmt0Zt97RCWhStXO1VLoss%3D&reserved=0>

Mike

From: slurm-users 
<slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Ratnasamy, Fritz 
<fritz.ratnas...@chicagobooth.edu<mailto:fritz.ratnas...@chicagobooth.edu>>
Date: Tuesday, August 24, 2021 at 21:00
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: [External] [slurm-users] scancel gpu jobs when gpu is not requested

CAUTION: This email originated from outside of the Colorado School of Mines 
organization. Do not click on links or open attachments unless you recognize 
the sender and know the content is safe.

Hello,

I have written a script in my prolog.sh that cancels any slurm job if the 
parameter gres=gpu is not present. This is the script i added to my prolog.sh

if [ $SLURM_JOB_PARTITION == "gpu" ]; then
        if [ ! -z "${GPU_DEVICE_ORDINAL}" ]; then
                echo "GPU ID used is ID: $GPU_DEVICE_ORDINAL "
                list_gpu=$(echo "$GPU_DEVICE_ORDINAL" | sed -e "s/,//g")
                Ngpu=$(expr length $list_gpu)
        else
                echo "No GPU selected"
                Ngpu=0
        fi

       # if  0 gpus were allocated, cancel the job

        if [ "$Ngpu" -eq "0" ]; then
              scancel ${SLURM_JOB_ID}                                          
fi
fi

What the code does is look at the number of gpus allocated, and if it is 0, 
cancel the job ID. It working fine if a user use sbatch submit.sh (and the 
submit.sh do not have the value --gres=gpu:1). However, when requesting an 
interactive session without gpus, the job is getting killed and the job hangs 
for 5-6 mins before getting killed.

jlo@mfe01:~ $ srun --partition=gpu --pty bash --login

srun: job 4631872 queued and waiting for resources

srun: job 4631872 has been allocated resources

srun: Force Terminated job 4631872 ...the killing hangs for 5-6minutes

Is there anything wrong with my script? Why only when scancel an interactive 
session, I am seeing this hanging. I would like to remove the hanging

Thanks

Fritz Ratnasamy

Data Scientist

Information Technology

The University of Chicago

Booth School of Business

5807 S. Woodlawn

Chicago, Illinois 60637

Phone: +(1) 773-834-4556

CAUTION: This email has originated outside of University email systems. Please 
do not click links or open attachments unless you recognize the sender and 
trust the contents as safe.

Re: [slurm-users] [EXT] Re: EXTERNAL-Re: [External] scancel gpu jobs when gpu is not requested

Reply via email to