Re: [gridengine users] FW: Requesting a resource OR another resource

William Hay Thu, 20 Nov 2014 05:32:18 -0800

On Thu, 20 Nov 2014 12:37:54 +0000
Kevin Taylor <[email protected]> wrote:


> I forgot to copy this back to the list.
> 
> So far, this concept might be looking very good for what we need. I'm able to 
> get files created, and modify the environment file for the job 
> accordingly...the next question is how do I handle the case of two jobs 
> launching simultaneously and writing to the same lockfile?
> 
> Thanks for the help everyone.

You use the standard unix trick for creating lockfiles.  Create the file with a 
temporary name.  Then attempt to hard link it to the official lockfile name.  
If this succeeds you have the lock.  If it fails you don't.  On a regular local 
filesystem the return code of the link syscall tells you if you succeeded.  If 
you are doing the link on NFS or a similar filesystem then you have to 
explicitly check if the lockfile and the temporary file are the same file after 
the link call as you can get a false negative.  Remember to clean up temporary 
files once you've grabbed the lock and to release(delete) any lockfiles your 
job owns (and only those) in the epilog. Incidentally you probably want to add 
the $SGE_TASK_ID to the identifying info in the lockfile in case someone 
submits an array job requesting gpus.

William   


> 
> ________________________________
> From: [email protected]
> To: [email protected]
> Subject: RE: [gridengine users] Requesting a resource OR another resource
> Date: Thu, 20 Nov 2014 07:15:32 -0500
> 
> How do you handle the issue of two jobs starting simultaneously? Here's what 
> I've done so far (our jobs for the moment are opengl, so we just need to know 
> the screen number :0.0 and :0.1). I write easier in perl.
> 
> #!/usr/bin/perl
> 
> $hostname = `hostname`;
> chop $hostname;
> 
> $jobnumber = $ENV{'JOB_ID'};
> $lockprefix = "/var/tmp/$hostname-gpuinfo";
> 
> # Determine if a GPU was requested via the job ID info
> $stats = `$ENV{'SGE_BINARY_PATH'}/qstat -j $jobnumber | grep gpu_free`;
> 
> # We asked for a GPU
> if ($stats) {
>    # Total number of GPUS
>    # nvidia-smi -L will list the GPUs for a total
>    $totalgpus = `/usr/bin/nvidia-smi -L | wc -l`;
> 
>    for ($i = 1; $i <= $totalgpus; $i++) {
>       # If a lockfile already exists
>       $lockfilename = "$lockprefix"."-GPU$i";
>       if ( -e "$lockfilename" ) {
>         next;
>       }
>       else {
>         system("echo $jobnumber >> $lockfilename");
>         $displayname=$i-1;
>         system("echo \"GPUNAME=:0.$displayname\" >> 
> $ENV{'SGE_JOB_SPOOL_DIR'}/environment");
>         system("chown $ENV{'SGE_O_LOGNAME'} $lockfilename");
>         exit 0;
>       }
>    }
> 
> }
> # If we didn't ask for a GPU, just exit gracefully.
> else {
>   exit 0;
> }
> 
> With this script I do run into a condition where if two jobs start at once, 
> the lockfiles aren't there, so they both create the first lockfile.
> 
> 
> 
> > Date: Thu, 20 Nov 2014 08:43:28 +0000
> > From: [email protected]
> > To: [email protected]
> > Subject: Re: [gridengine users] Requesting a resource OR another resource
> >
> > On Wed, 19 Nov 2014 16:47:56 +0000
> > Kevin Taylor <[email protected]> wrote:
> >
> > > So, the lock file you create is just there to identify the assigned GPU? 
> > > I haven't done anything with prolog stuff before, but I'll take a look.
> >
> > Sort of. Since unix doesn't provide an atomic test and set for chgrp so we 
> > use lock files to prevent races when two jobs are starting simultaneously.
> > The lock file also contains info that identifies the job uniquely while the 
> > per-job groups get reused. We can use this to detect if anything goes wrong
> > with the normal cleanup when a job terminates.
> >
> >
> > --
> > William Hay <[email protected]>


-- 
William Hay <[email protected]>

pgpUYUuFEVhUW.pgp
Description: PGP signature

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] FW: Requesting a resource OR another resource

Reply via email to