Hi Semi,

The GPU load sensor is for monitoring the health of GPU devices, and
again it is very similar to the product from Bright - but note that
Bright has a very nice GUI (and we are not planning to compete against
Bright, so most likely the Open Grid Scheduler project will not try to
implement a GUI front-end for our GPU load sensor).

http://www.brightcomputing.com/NVIDIA-GPU-Cluster-Management-Monitoring.php


And note that Ganglia also has a plugin for NVML:

https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia


We (Bright, Ganglia, and the GPU load sensor distributed by the Open
Grid Scheduler project) all use the NVML API from NVIDIA to monitor
the GPU health - if you want monitoring of your GPU devices then you
can just pick one of the solutions.

And the complex setup is for internal accounting inside Grid Engine -
ie. it tells GE how many GPU cards there are and how many are in use.
The method described by the Rocks mailing list message is more
powerful and more complex - the method I described models GPUs as any
other consumable resources, like software licenses, disk space, etc.

Rayson



On Sun, May 20, 2012 at 9:28 AM, Reuti <[email protected]> wrote:
> Am 20.05.2012 um 13:24 schrieb Semi:
>
>> Please correct me, if I understand right your proposal for definitions and 
>> usage:
>> qconf -sc|grep gpu
>> gpu                 gpu          INT         <=    YES         YES 0        0
>>
>> qconf -me sge135
>> hostname              sge135
>> load_scaling          NONE
>> complex_values        gpu=2
>>
>> qsub -l gpu=2 test.sh
>>
>> load_sensor is not needed.
>
> Yes, this is fine and preferred IMO. The ROCKS link uses only a BOOL complex 
> and put the amount in the queue definition instead, i.e. it can't be shared 
> across several queues.
>
> -- Reuti
>
>
>>
>> On 5/20/2012 12:44 PM, Reuti wrote:
>>> Am 20.05.2012 um 10:21 schrieb Semi:
>>>
>>>
>>>> Hi Rayson!
>>>>
>>>> Can I use this method for GPU definition?  It's more clear for me.
>>>>
>>>>
>>>> https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2011-August/054479.html
>>> Requesting a CUDA complex and a CUDA queue looks redundant. And it's also 
>>> only for a single GPU per machine AFAICS.What's wrong with Rayson's setup, 
>>> it's just a different type of complex?
>>>
>>> -- Reuti
>>>
>>>
>>>
>>>> On 5/17/2012 7:09 PM, Rayson Ho wrote:
>>>>
>>>>> On Tue, May 15, 2012 at 6:11 AM, Semi<[email protected]>
>>>>>   wrote:
>>>>>
>>>>>> Can you give me more detailed answer and correct my definitions.
>>>>>>
>>>>> Hi Semi,
>>>>>
>>>>> I was away for the past 2 days. Please always cc the list when you are
>>>>> replying (I guess Reuti, Ron, and I always suggest people to do that -
>>>>> there are many ways to configure Grid Engine, and others may see
>>>>> something that we don't see, and it is uaually better to get feedback
>>>>> from more people).
>>>>>
>>>>> On the other hand, if you really need  you might consider support (
>>>>>
>>>>> http://www.scalablelogic.com/scalable-grid-engine-support
>>>>>  ). There is
>>>>> always someone who can respond to your questions even when I am away.
>>>>>
>>>>>
>>>>>
>>>>>> qconf -sc|grep gpu
>>>>>> gpu                 gpu          INT<=    YES         YES
>>>>>> 0        0
>>>>>>
>>>>>> qconf -me sge135
>>>>>> hostname              sge135
>>>>>> load_scaling          NONE
>>>>>> complex_values        gpu=2
>>>>>>
>>>>>> qconf -mconf sge135
>>>>>> sge135:
>>>>>> mailer                       /bin/mail
>>>>>> xterm                        /usr/bin/X11/xterm
>>>>>> qlogin_daemon                /usr/sbin/in.telnetd
>>>>>> rlogin_daemon                /usr/sbin/in.rlogind
>>>>>> load_sensor                  /storage/SGE6U8/gpu-load-sensor/cuda_sensor
>>>>>>
>>>>> Note that if you statically define a host to have 2 GPUs, then you
>>>>> don't need to use the cuda_sensor. The GPU load sensor distributed by
>>>>> the Open Grid Scheduler project (which you can find in other Grid
>>>>> Engine implementations) is very similar to Bright Computing's GPU
>>>>> Management in the Bright Cluster Manager:
>>>>>
>>>>>
>>>>> http://www.brightcomputing.com/NVIDIA-GPU-Cluster-Management-Monitoring.php
>>>>>
>>>>>
>>>>> We both monitor temperature, fan speed, voltage, ECC, etc. When we
>>>>> started the GPU load sensor development we didn't know that Bright had
>>>>> something similar...
>>>>>
>>>>> From a scheduling point of view, you can ignore most of that. Some
>>>>> sites like to bias node priority based on GPU temperature, and in some
>>>>> cases if the ECC error is real bad then the GPU should not be used for
>>>>> GPU jobs.
>>>>>
>>>>>
>>>>>
>>>>>> qsub -l gpu=1 test.sh
>>>>>>
>>>>>> And if I need parallel run on GPU. What I have to do? How define pe for 
>>>>>> GPU?
>>>>>>
>>>>> You just use "qsub -l gpu=2" if you want to use 2 GPUs for that job.
>>>>>
>>>>> Rayson
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 5/14/2012 2:51 PM, Rayson Ho wrote:
>>>>>>
>>>>>> Just get the load sensor from:
>>>>>>
>>>>>>
>>>>>> https://gridscheduler.svn.sourceforge.net/svnroot/gridscheduler/trunk/source/dist/gpu/gpu_sensor.c
>>>>>>
>>>>>>
>>>>>> Compile it on your system - and make sure that it has the CUDA SDK&
>>>>>> libraries installed (Google is your friend - look for the nvidia-ml
>>>>>> library).
>>>>>>
>>>>>> % cc gpu_sensor.c -lnvidia-ml
>>>>>>
>>>>>> Before you use it as a load sensor, compile and run it interactively:
>>>>>>
>>>>>> % cc gpu_sensor.c -DSTANDALONE -lnvidia-ml
>>>>>>
>>>>>> Make sure that the code is reporting something meaningful on your system.
>>>>>>
>>>>>> Rayson
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, May 14, 2012 at 4:55 AM, Semi
>>>>>> <[email protected]>
>>>>>>   wrote:
>>>>>>
>>>>>> Please help in GPU integration under SGE and parallel running of NAMD and
>>>>>> GAMESS on GPU via SGE.
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>>
>>>>>> [email protected]
>>>>>> https://gridengine.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>>
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to