And here is some more info:

http://serverfault.com/questions/322073/howto-set-up-sge-for-cuda-devices

On Mon, Apr 14, 2014 at 1:39 PM, Ian Kaufman <ikauf...@eng.ucsd.edu> wrote:
> If everything is configured correctly, GridEngine will be aware that
> the GPU in node1 is in use, and schedule around it, ensuring that the
> 8 GPU job will get unused GPUs.
>
> Ian
>
> On Mon, Apr 14, 2014 at 1:38 PM, Ian Kaufman <ikauf...@eng.ucsd.edu> wrote:
>> Look at the info presented here:
>>
>> http://stackoverflow.com/questions/10557816/scheduling-gpu-resources-using-the-sun-grid-engine-sge
>>
>> Ian
>>
>> On Mon, Apr 14, 2014 at 1:29 PM, Feng Zhang <prod.f...@gmail.com> wrote:
>>> Thanks, Ian and Gowtham!
>>>
>>>
>>> This is a very nice instruction.  One of my problem is, for example:
>>>
>>> node1,  number of gpu=4
>>> node2,  number of gpu=4
>>> node3,  number of gpu=2
>>>
>>> So in total I have 10 GPUs.
>>>
>>> Right now, user A has a serial GPU job, which takes one GPU on
>>> node1(Don't know which GPU though). So node1:3, node2:4 and node3:2
>>> GPUs are still free for jobs.
>>>
>>> I submit one job with PE=8. SGE allocate all the 3 nodes to me with 8
>>> GPU slots. The problem is now: how my job knows what GPUs it can get
>>> on node1?
>>>
>>> Best
>>>
>>>
>>>
>>>
>>> On Mon, Apr 14, 2014 at 4:13 PM, Ian Kaufman <ikauf...@eng.ucsd.edu> wrote:
>>>> Again, look into using it as a consumable resource as Gowtham posted above.
>>>>
>>>> Ian
>>>>
>>>> On Mon, Apr 14, 2014 at 11:57 AM, Feng Zhang <prod.f...@gmail.com> wrote:
>>>>> Thanks, Reuti,
>>>>>
>>>>> The socket solution looks like only work fine for serial jobs, not PE
>>>>> jobs, right?
>>>>>
>>>>> Our cluster has different nodes, some nodes each has 2 GPUs, some
>>>>> others each has 4 GPUs. Most of the user jobs are PE jobs, some are
>>>>> serial.
>>>>>
>>>>> The socket solution can event work for PE jobs, but as my
>>>>> understanding, it is not efficient? Since each node has, for example,
>>>>> 4 queues. If one user submit a PE job to a queue, he/she can not use
>>>>> the other GPUs on the other queues?
>>>>>
>>>>> On Mon, Apr 14, 2014 at 2:16 PM, Reuti <re...@staff.uni-marburg.de> wrote:
>>>>>> Am 14.04.2014 um 20:06 schrieb Feng Zhang:
>>>>>>
>>>>>>> Thanks, Ian!
>>>>>>>
>>>>>>> I haven't checked the GPU load sensor in detail, either. It sounds to
>>>>>>> me it only handles the number of GPU allocated to a job, but the job
>>>>>>> doesn't know which GPUs it actually get and set the
>>>>>>> CUDA_VISIBLE_DEVICE(some programs need this env to be set). This can
>>>>>>> be done by writing some scripts/programs, but to me, it is not an
>>>>>>> accurate solution, since some jobs may still happen to collide to each
>>>>>>> other on the same GPU on a multiple GPU node. If GE can have the
>>>>>>> memory to record the GPUs allocated to a job, then this can be
>>>>>>> perfect.
>>>>>>
>>>>>> Like the option to request sockets instead of cores which I posted in 
>>>>>> the last couple of days, you can use a similar approach to get the 
>>>>>> number of the granted GPU out of the queue name.
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>>>> On Mon, Apr 14, 2014 at 1:46 PM, Ian Kaufman <ikauf...@eng.ucsd.edu> 
>>>>>>> wrote:
>>>>>>>> I believe there already is support for GPUs - there is a GPU Load
>>>>>>>> Sensor in Open Grid Engine. You may have to build it yourself, I
>>>>>>>> haven't checked to see if it comes pre-packaged.
>>>>>>>>
>>>>>>>> Univa has Phi support, and I believe OGE/OGS has it as well, or at
>>>>>>>> least has been working on it.
>>>>>>>>
>>>>>>>> Ian
>>>>>>>>
>>>>>>>> On Mon, Apr 14, 2014 at 10:35 AM, Feng Zhang <prod.f...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Is there's any plan to implement the GPU resource management in SGE in
>>>>>>>>> the near future? Like Slurm or Torque? There are some ways to do this
>>>>>>>>> using scripts/programs, but I wonder that if the SGE itself can
>>>>>>>>> recognize and manage GPU(and Phi). Not need to be complicated and
>>>>>>>>> powerful, just do basic work.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users@gridengine.org
>>>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ian Kaufman
>>>>>>>> Research Systems Administrator
>>>>>>>> UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users@gridengine.org
>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ian Kaufman
>>>> Research Systems Administrator
>>>> UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
>>
>>
>>
>> --
>> Ian Kaufman
>> Research Systems Administrator
>> UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
>
>
>
> --
> Ian Kaufman
> Research Systems Administrator
> UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu



-- 
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to