Re: [gridengine users] Multi-GPU setup

2019-08-28 Thread Hay, William
On Wed, Aug 14, 2019 at 05:11:02PM +0200, Nicolas FOURNIALS wrote:
> Hi,
> 
> Le 14/08/2019 à 16:35, Andreas Haupt a écrit :
> > Preventing access to the 'wrong' gpu devices by "malicious jobs" is not
> > that easy. An idea could be to e.g. play with device permissions.
> 
> That's what we do by having /dev/nvidia[0-n] files owned by root and with
> permissions 660.
> Prolog (executed as root) changes the file owner to give it to the user
> running the job. Epilog gives the file back to root.
We do something similar but change the group of the device to match the one
assigned to the job.  This allows for multiple jobs from the same user 
without interference.  You have to set a magic kernel option
to prevent ther permissions on the device files from auto-changing.


William


signature.asc
Description: PGP signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Multi-GPU setup

2019-08-20 Thread Dj Merrill
Apologies, I should have followed up on this.  It looks like they've
already started work on handling the NVidia device permissions.  Look
under the branches section, and there are useful notes in both the
"hardened" and "nvidia_dev_chgrp" branches.

https://github.com/RSE-Sheffield/sge-gpuprolog/branches

I haven't yet had a chance to do much with this yet.

-Dj
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Multi-GPU setup

2019-08-20 Thread Nicolas FOURNIALS



Le 14/08/2019 à 19:50, Dj Merrill a écrit :

Thanks everyone for the feedback.  I found this on Github that looks
promising:

https://github.com/RSE-Sheffield/sge-gpuprolog


Thanks for pointing it.



I can probably edit the scripts to also change the permissions on the
/dev/nvidia* devices as some of you have suggested, which would make sense.

If anyone is willing to share their working prolog/epilog scripts that
change the /dev/nvidia* permissions, I would greatly appreciate it.


I added an issue to the above project to write some ideas about how-to 
add this functionality. We use UGE here, which has a different way of 
managing the complex resources. But I suppose once you formed an index 
of attributed GPU(s), the change to enforce device file permissions is a 
simple as described in the issue 
https://github.com/RSE-Sheffield/sge-gpuprolog/issues/24


--
Nicolas Fournials
System administrator
CC-IN2P3/CNRS
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Multi-GPU setup

2019-08-14 Thread Dj Merrill
Thanks everyone for the feedback.  I found this on Github that looks
promising:

https://github.com/RSE-Sheffield/sge-gpuprolog

and this to go with it:

https://gist.github.com/willfurnass/10277756070c4f374e6149a281324841

I can probably edit the scripts to also change the permissions on the
/dev/nvidia* devices as some of you have suggested, which would make sense.

If anyone is willing to share their working prolog/epilog scripts that
change the /dev/nvidia* permissions, I would greatly appreciate it.

Thanks,

-Dj


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Multi-GPU setup

2019-08-14 Thread Joshua Baker-LePain

On Wed, 14 Aug 2019 at 7:21am, Dj Merrill wrote


To date in our HPC Grid running Son of Grid Engine 8.1.9, we've had
single Nvidia GPU cards per compute node.  We are contemplating the
purchase of a single compute node that has multiple GPU cards in it, and
want to ensure that running jobs only have access to the GPU resources
they ask for, and don't take over all of the GPU cards in the system.


We use epilog and prolog scripts based on 
 to assign GPUs to jobs.  It's 
(obviously) up to the users' scripts to honor the assignments, but it's 
been working for us so far.



We define gpu as a resource:
qconf -sc:
#name   shortcut   type  relop   requestable consumable
default  urgency
gpu gpuINT   <=  YES YES0
   0


We *used* to run this way until we ran into what seems like a bug in SoGE 
8.1.9.  See  
and the ensuing thread for details, but the summary is that SGE would 
insist on trying to run a job on a particular node, even if there were 
free GPUs elsewhere.  It was happening so often that we had to change our 
approach, and defined a queue on each GPU node with the same 
number of slots as GPUs.  It's a far from perfect system, but it's working 
for now.


--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Multi-GPU setup

2019-08-14 Thread bergman
In the message dated: Wed, 14 Aug 2019 10:21:12 -0400,
The pithy ruminations from Dj Merrill on 
[[gridengine users] Multi-GPU setup] were:
=> To date in our HPC Grid running Son of Grid Engine 8.1.9, we've had
=> single Nvidia GPU cards per compute node.  We are contemplating the
=> purchase of a single compute node that has multiple GPU cards in it, and
=> want to ensure that running jobs only have access to the GPU resources
=> they ask for, and don't take over all of the GPU cards in the system.

That's an issue.

=> 
=> We define gpu as a resource:
=> qconf -sc:
=> #name   shortcut   type  relop   requestable consumable
=> default  urgency
=> gpu gpuINT   <=  YES YES0
=> 0
=> 
=> We define GPU persistence mode and exclusive process on each node:
=> nvidia-smi -pm 1
=> nvidia-smi -c 3

Good.

=> 
=> We set the number of GPUs in the host definition:
=> qconf -me (hostname)
=> 
=> complex_values   gpu=1   for our existing nodes, and this setup has been
=> working fine for us.

Good.

=> 
=> With the new system, we would set:
=> complex_values   gpu=4

Yes.

=> 
=> 
=> If a job is submitted asking for one GPU, will it be limited to only
=> having access to a single GPU card on the system, or can it detect the
=> other cards and take up all four (and if so how do we prevent that)?

There are two issues you'll need to deal with:

1. Preventing a job from using more than the requested number of GPUs
I don't have a great answer for that. As you see, SGE is good at 
keeping track of the number
of instances of a resource (the count), but not which physical GPU is 
assigned to a job.

For a cgroups-like solution, see:
http://gridengine.org/pipermail/users/2014-November/008128.html
http://gridengine.org/pipermail/users/2017-October/009952.html
http://gridengine.org/pipermail/users/2017-February/009581.html


I don't have experience with the described method, but the trick (using
a job prolog to chgrp the /dev/nvidia${GPUNUM} device) is on my list
of things-to-do.

2. Ensuring that a job tries to use a free GPU, not just _any_ GPU
Since SGE doesn't explicitely tell the job which GPU to use,
we've found that a lot of software blindly tries to use GPU
#0, apparently assuming that the software is running on a
single-user/single-GPU system (python, I'm looking at you). Our
solution has been to "suggest" that users run a command in their
submit script to report to the number (GPU ID) of the next free
GPU. This has eliminated most instances of this issue, but there
are still some race conditions.

#
#! /bin/bash

# Script to return GPU Id of idle GPUs, if any
#
# Used in a submit script, in the form
#
#   CUDA_VISIBLE_DEVICES=`get_CUDA_VISIBLE_DEVICES` || exit
#   export CUDA_VISIBLE_DEVICES
#   myGPUjob 
# Some software takes the specification of the GPU device on the command line. 
In that case, the command line might be something like:
#
#   myGPUjob options -dev cuda${CUDA_VISIBLE_DEVICES}
#


# The command:
#nvidia-smi pmon
# returns output in the form:
#
#   # gpupid  typesm   mem   enc   dec   command
#   # Idx  #   C/G % % % %   name
#   0  - - - - - -   -
#
# Note the absence (-) of a PID to indicate an idle GPU

which nvidia-smi 1> /dev/null 2>&1
if [ $? != 0  ] ; then
# no nvidia-smi found!
echo "-1"
echo "No 'nvidia-smi' utility found on node `hostname -s` at `date`." 
1>&2
if [ "X$JOB_ID" != "X" ] ; then
# running as a batch job, this shouldn't happen
( printf "SGE job ${JOB_ID}: No 'nvidia-smi' utility found on 
node `hostname -s` at `date`.\n") 1>&2 | Mail -s "unexpected: no nvidia-smi 
utility on `hostname -s`" root
fi
exit 1
fi

numGPUs=`nvidia-smi pmon -c 1 | wc -l` ; numGPUs=$((numGPUs -2))# 
subtract the headers
free=`nvidia-smi pmon -c 1 | awk '{if ( $2 == "-" ) {print $1 ; exit}}'`

if [[ "X$free" != "X" && $numGPUs -gt 1 ]] ; then
# we may have a race condition, where 2 (or more) GPU jobs are probing 
nvidia-smi at once, and each is reporting that there is a free GPU
# are available. Sleep a random amount of time and check againthis 
is not guanteed to avoid the conflict, but it 
# will help...
sleep $((RANDOM % 11))
free=`nvidia-smi pmon -c 1 | awk '{if ( $2 == "-" ) {print $1 ; exit}}'`
fi

if [ "X$free" = "X" ] ; then
echo "-1"
echo "SGE job ${JOB_ID} (${JOB_NAME}) failed: no free GPU on node 
`hostname -s` at `date`." 1>&2
( printf "SGE job ${JOB_ID}, job name ${JOB_NAME} from $USER\nNo free 
GPU on node 

Re: [gridengine users] Multi-GPU setup

2019-08-14 Thread Ian Kaufman
You could probably do this using consumables and using resource quoatas to
enforce them.

Ian

On Wed, Aug 14, 2019 at 8:34 AM Christopher Heiny 
wrote:

> On Wed, 2019-08-14 at 16:35 +0200, Andreas Haupt wrote:
> > Hi Dj,
> >
> > we do this by setting $CUDA_VISIBLE_DEVICES in a prolog script (and
> > according to what has been requested by the job).
> >
> > Preventing access to the 'wrong' gpu devices by "malicious jobs" is
> > not
> > that easy. An idea could be to e.g. play with device permissions.
>
>
> We use the same approach on our SGE 8.1.9 cluster, with consumables for
> number of GPUs needed and GPU RAM required, and other requestable
> attributes for GPU model, Cuda level and so on.
>
> Fortunately, the user base is small and very cooperative, so at this
> time I'm not worried about malicious users.
>
> Cheers,
> Chris
>
> >
> > Cheers,
> > Andreas
> >
> > On Wed, 2019-08-14 at 10:21 -0400, Dj Merrill wrote:
> > > To date in our HPC Grid running Son of Grid Engine 8.1.9, we've had
> > > single Nvidia GPU cards per compute node.  We are contemplating the
> > > purchase of a single compute node that has multiple GPU cards in
> > > it, and
> > > want to ensure that running jobs only have access to the GPU
> > > resources
> > > they ask for, and don't take over all of the GPU cards in the
> > > system.
> > >
> > > We define gpu as a resource:
> > > qconf -sc:
> > > #name   shortcut   type  relop   requestable
> > > consumable
> > > default  urgency
> > > gpu gpuINT   <=  YES YES
> > >  0
> > > 0
> > >
> > > We define GPU persistence mode and exclusive process on each node:
> > > nvidia-smi -pm 1
> > > nvidia-smi -c 3
> > >
> > > We set the number of GPUs in the host definition:
> > > qconf -me (hostname)
> > >
> > > complex_values   gpu=1   for our existing nodes, and this setup has
> > > been
> > > working fine for us.
> > >
> > > With the new system, we would set:
> > > complex_values   gpu=4
> > >
> > >
> > > If a job is submitted asking for one GPU, will it be limited to
> > > only
> > > having access to a single GPU card on the system, or can it detect
> > > the
> > > other cards and take up all four (and if so how do we prevent
> > > that)?
> > >
> > > Is there something like "cgroups" for gpus?
> > >
> > > Thanks,
> > >
> > > -Dj
> > >
> > >
> > > ___
> > > users mailing list
> > > users@gridengine.org
> > > https://gridengine.org/mailman/listinfo/users
> >
> > --
> > > Andreas Haupt| E-Mail: andreas.ha...@desy.de
> > >  DESY Zeuthen| WWW:
> > > http://www-zeuthen.desy.de/~ahaupt
> > >  Platanenallee 6 | Phone:  +49/33762/7-7359
> > >  D-15738 Zeuthen | Fax:+49/33762/7-7216
> >
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>


-- 
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Multi-GPU setup

2019-08-14 Thread Christopher Heiny
On Wed, 2019-08-14 at 16:35 +0200, Andreas Haupt wrote:
> Hi Dj,
> 
> we do this by setting $CUDA_VISIBLE_DEVICES in a prolog script (and
> according to what has been requested by the job).
> 
> Preventing access to the 'wrong' gpu devices by "malicious jobs" is
> not
> that easy. An idea could be to e.g. play with device permissions.


We use the same approach on our SGE 8.1.9 cluster, with consumables for
number of GPUs needed and GPU RAM required, and other requestable
attributes for GPU model, Cuda level and so on.

Fortunately, the user base is small and very cooperative, so at this
time I'm not worried about malicious users.

Cheers,
Chris

> 
> Cheers,
> Andreas
> 
> On Wed, 2019-08-14 at 10:21 -0400, Dj Merrill wrote:
> > To date in our HPC Grid running Son of Grid Engine 8.1.9, we've had
> > single Nvidia GPU cards per compute node.  We are contemplating the
> > purchase of a single compute node that has multiple GPU cards in
> > it, and
> > want to ensure that running jobs only have access to the GPU
> > resources
> > they ask for, and don't take over all of the GPU cards in the
> > system.
> > 
> > We define gpu as a resource:
> > qconf -sc:
> > #name   shortcut   type  relop   requestable
> > consumable
> > default  urgency
> > gpu gpuINT   <=  YES YES   
> >  0
> > 0
> > 
> > We define GPU persistence mode and exclusive process on each node:
> > nvidia-smi -pm 1
> > nvidia-smi -c 3
> > 
> > We set the number of GPUs in the host definition:
> > qconf -me (hostname)
> > 
> > complex_values   gpu=1   for our existing nodes, and this setup has
> > been
> > working fine for us.
> > 
> > With the new system, we would set:
> > complex_values   gpu=4
> > 
> > 
> > If a job is submitted asking for one GPU, will it be limited to
> > only
> > having access to a single GPU card on the system, or can it detect
> > the
> > other cards and take up all four (and if so how do we prevent
> > that)?
> > 
> > Is there something like "cgroups" for gpus?
> > 
> > Thanks,
> > 
> > -Dj
> > 
> > 
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 
> --
> > Andreas Haupt| E-Mail: andreas.ha...@desy.de
> >  DESY Zeuthen| WWW:
> > http://www-zeuthen.desy.de/~ahaupt
> >  Platanenallee 6 | Phone:  +49/33762/7-7359
> >  D-15738 Zeuthen | Fax:+49/33762/7-7216
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Multi-GPU setup

2019-08-14 Thread Friedrich Ferstl
Yes, UGE supports this out of the box. Depending on whether the job is a 
regular job or a Docker container the method used to restrict access only to 
the assigned GPU is slightly different. UGE also will only schedule jobs to 
nodes where it is guaranteed to be able doing this.

The interface for configuring this are some fairly versatile extensions to 
RSMAPs as pointed to by Skylar.

Cheers,

Fritz

> Am 14.08.2019 um 17:16 schrieb Tina Friedrich :
> 
> Hello,
> 
> from a kernel/mechanism point of view, it is perfectly possible to 
> restrict device access using cgroups. I use that on my current cluster, 
> works really well (both for things like CPU cores and GPUs - you only 
> see what you request, even using something like 'nvidia-smi').
> 
> Sadly, my current cluster isn't Grid Engine based :( and I have no idea 
> if SoGE or UGE support doing so out of the box - I've never had to do 
> that whilst still working with Grid Engine. Wouldn't be surprised if UGE 
> can do it.
> 
> You could probably script something yourself - I know I made a custom 
> suspend method once that used cgroups for non-MPI jobs.
> 
> Tina
> 
> On 14/08/2019 15:35, Andreas Haupt wrote:
>> Hi Dj,
>> 
>> we do this by setting $CUDA_VISIBLE_DEVICES in a prolog script (and
>> according to what has been requested by the job).
>> 
>> Preventing access to the 'wrong' gpu devices by "malicious jobs" is not
>> that easy. An idea could be to e.g. play with device permissions.
>> 
>> Cheers,
>> Andreas
>> 
>> On Wed, 2019-08-14 at 10:21 -0400, Dj Merrill wrote:
>>> To date in our HPC Grid running Son of Grid Engine 8.1.9, we've had
>>> single Nvidia GPU cards per compute node.  We are contemplating the
>>> purchase of a single compute node that has multiple GPU cards in it, and
>>> want to ensure that running jobs only have access to the GPU resources
>>> they ask for, and don't take over all of the GPU cards in the system.
>>> 
>>> We define gpu as a resource:
>>> qconf -sc:
>>> #name   shortcut   type  relop   requestable consumable
>>> default  urgency
>>> gpu gpuINT   <=  YES YES0
>>> 0
>>> 
>>> We define GPU persistence mode and exclusive process on each node:
>>> nvidia-smi -pm 1
>>> nvidia-smi -c 3
>>> 
>>> We set the number of GPUs in the host definition:
>>> qconf -me (hostname)
>>> 
>>> complex_values   gpu=1   for our existing nodes, and this setup has been
>>> working fine for us.
>>> 
>>> With the new system, we would set:
>>> complex_values   gpu=4
>>> 
>>> 
>>> If a job is submitted asking for one GPU, will it be limited to only
>>> having access to a single GPU card on the system, or can it detect the
>>> other cards and take up all four (and if so how do we prevent that)?
>>> 
>>> Is there something like "cgroups" for gpus?
>>> 
>>> Thanks,
>>> 
>>> -Dj
>>> 
>>> 
>>> ___
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>> 
>>> ___
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Multi-GPU setup

2019-08-14 Thread Tina Friedrich
Hello,

from a kernel/mechanism point of view, it is perfectly possible to 
restrict device access using cgroups. I use that on my current cluster, 
works really well (both for things like CPU cores and GPUs - you only 
see what you request, even using something like 'nvidia-smi').

Sadly, my current cluster isn't Grid Engine based :( and I have no idea 
if SoGE or UGE support doing so out of the box - I've never had to do 
that whilst still working with Grid Engine. Wouldn't be surprised if UGE 
can do it.

You could probably script something yourself - I know I made a custom 
suspend method once that used cgroups for non-MPI jobs.

Tina

On 14/08/2019 15:35, Andreas Haupt wrote:
> Hi Dj,
> 
> we do this by setting $CUDA_VISIBLE_DEVICES in a prolog script (and
> according to what has been requested by the job).
> 
> Preventing access to the 'wrong' gpu devices by "malicious jobs" is not
> that easy. An idea could be to e.g. play with device permissions.
> 
> Cheers,
> Andreas
> 
> On Wed, 2019-08-14 at 10:21 -0400, Dj Merrill wrote:
>> To date in our HPC Grid running Son of Grid Engine 8.1.9, we've had
>> single Nvidia GPU cards per compute node.  We are contemplating the
>> purchase of a single compute node that has multiple GPU cards in it, and
>> want to ensure that running jobs only have access to the GPU resources
>> they ask for, and don't take over all of the GPU cards in the system.
>>
>> We define gpu as a resource:
>> qconf -sc:
>> #name   shortcut   type  relop   requestable consumable
>> default  urgency
>> gpu gpuINT   <=  YES YES0
>>  0
>>
>> We define GPU persistence mode and exclusive process on each node:
>> nvidia-smi -pm 1
>> nvidia-smi -c 3
>>
>> We set the number of GPUs in the host definition:
>> qconf -me (hostname)
>>
>> complex_values   gpu=1   for our existing nodes, and this setup has been
>> working fine for us.
>>
>> With the new system, we would set:
>> complex_values   gpu=4
>>
>>
>> If a job is submitted asking for one GPU, will it be limited to only
>> having access to a single GPU card on the system, or can it detect the
>> other cards and take up all four (and if so how do we prevent that)?
>>
>> Is there something like "cgroups" for gpus?
>>
>> Thanks,
>>
>> -Dj
>>
>>
>> ___
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>>
>> ___
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Multi-GPU setup

2019-08-14 Thread Nicolas FOURNIALS

Hi,

Le 14/08/2019 à 16:35, Andreas Haupt a écrit :

Preventing access to the 'wrong' gpu devices by "malicious jobs" is not
that easy. An idea could be to e.g. play with device permissions.


That's what we do by having /dev/nvidia[0-n] files owned by root and 
with permissions 660.
Prolog (executed as root) changes the file owner to give it to the user 
running the job. Epilog gives the file back to root.

It works fine for us.
If we had no possibility to run prolog/epilog as root, we'd have had the 
possibility to write a small script caring about the owner change, that 
we'd have run with sudo in the prolog; the protection would have been 
less effective (a malicious user could search for the script and call it 
himself) but that prevents any mistaken use of a GPU not attributed to 
the job.


Regards,

--
Nicolas Fournials
System administrator
CC-IN2P3/CNRS
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Multi-GPU setup

2019-08-14 Thread Skylar Thompson
Hi DJ,

I'm not sure if SoGE supports it, but UGE has the concept of "resource
maps" (aka RSMAP) complexes which we use to assign specific hardware
resources to specific jobs. It functions sort of as a hybrid array/scalar
consumable.

It looks like this in the host complex_values configuration:

cuda=4(0-3)

Which gives four CUDA-capable devices, with IDs 0-3. UGE sets SGE_HGR_cuda in
the job environment to the assigned job ID:

$ echo "${SGE_HGR_cuda}"
0

When you look at it as a consumable, it is just an integer value, though:

n030lx-amd64   242   12   24  0.01  757.0G   11.4G  
  8.0G   48.0K
Host Resource(s):  hc:cuda=3.00

Which shows three of the four GPU devices are available for use.

On Wed, Aug 14, 2019 at 10:21:12AM -0400, Dj Merrill wrote:
> To date in our HPC Grid running Son of Grid Engine 8.1.9, we've had
> single Nvidia GPU cards per compute node.  We are contemplating the
> purchase of a single compute node that has multiple GPU cards in it, and
> want to ensure that running jobs only have access to the GPU resources
> they ask for, and don't take over all of the GPU cards in the system.
> 
> We define gpu as a resource:
> qconf -sc:
> #name   shortcut   type  relop   requestable consumable
> default  urgency
> gpu gpuINT   <=  YES YES0
> 0
> 
> We define GPU persistence mode and exclusive process on each node:
> nvidia-smi -pm 1
> nvidia-smi -c 3
> 
> We set the number of GPUs in the host definition:
> qconf -me (hostname)
> 
> complex_values   gpu=1   for our existing nodes, and this setup has been
> working fine for us.
> 
> With the new system, we would set:
> complex_values   gpu=4
> 
> 
> If a job is submitted asking for one GPU, will it be limited to only
> having access to a single GPU card on the system, or can it detect the
> other cards and take up all four (and if so how do we prevent that)?
> 
> Is there something like "cgroups" for gpus?
> 
> Thanks,
> 
> -Dj
> 
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

-- 
-- Skylar Thompson (skyl...@u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Multi-GPU setup

2019-08-14 Thread Andreas Haupt
Hi Dj,

we do this by setting $CUDA_VISIBLE_DEVICES in a prolog script (and
according to what has been requested by the job).

Preventing access to the 'wrong' gpu devices by "malicious jobs" is not
that easy. An idea could be to e.g. play with device permissions.

Cheers,
Andreas

On Wed, 2019-08-14 at 10:21 -0400, Dj Merrill wrote:
> To date in our HPC Grid running Son of Grid Engine 8.1.9, we've had
> single Nvidia GPU cards per compute node.  We are contemplating the
> purchase of a single compute node that has multiple GPU cards in it, and
> want to ensure that running jobs only have access to the GPU resources
> they ask for, and don't take over all of the GPU cards in the system.
> 
> We define gpu as a resource:
> qconf -sc:
> #name   shortcut   type  relop   requestable consumable
> default  urgency
> gpu gpuINT   <=  YES YES0
> 0
> 
> We define GPU persistence mode and exclusive process on each node:
> nvidia-smi -pm 1
> nvidia-smi -c 3
> 
> We set the number of GPUs in the host definition:
> qconf -me (hostname)
> 
> complex_values   gpu=1   for our existing nodes, and this setup has been
> working fine for us.
> 
> With the new system, we would set:
> complex_values   gpu=4
> 
> 
> If a job is submitted asking for one GPU, will it be limited to only
> having access to a single GPU card on the system, or can it detect the
> other cards and take up all four (and if so how do we prevent that)?
> 
> Is there something like "cgroups" for gpus?
> 
> Thanks,
> 
> -Dj
> 
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
-- 
| Andreas Haupt| E-Mail: andreas.ha...@desy.de
|  DESY Zeuthen| WWW:http://www-zeuthen.desy.de/~ahaupt
|  Platanenallee 6 | Phone:  +49/33762/7-7359
|  D-15738 Zeuthen | Fax:+49/33762/7-7216



smime.p7s
Description: S/MIME cryptographic signature
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users