Hi Semi, The GPU load sensor is for monitoring the health of GPU devices, and again it is very similar to the product from Bright - but note that Bright has a very nice GUI (and we are not planning to compete against Bright, so most likely the Open Grid Scheduler project will not try to implement a GUI front-end for our GPU load sensor).
http://www.brightcomputing.com/NVIDIA-GPU-Cluster-Management-Monitoring.php And note that Ganglia also has a plugin for NVML: https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia We (Bright, Ganglia, and the GPU load sensor distributed by the Open Grid Scheduler project) all use the NVML API from NVIDIA to monitor the GPU health - if you want monitoring of your GPU devices then you can just pick one of the solutions. And the complex setup is for internal accounting inside Grid Engine - ie. it tells GE how many GPU cards there are and how many are in use. The method described by the Rocks mailing list message is more powerful and more complex - the method I described models GPUs as any other consumable resources, like software licenses, disk space, etc. Rayson On Sun, May 20, 2012 at 9:28 AM, Reuti <[email protected]> wrote: > Am 20.05.2012 um 13:24 schrieb Semi: > >> Please correct me, if I understand right your proposal for definitions and >> usage: >> qconf -sc|grep gpu >> gpu gpu INT <= YES YES 0 0 >> >> qconf -me sge135 >> hostname sge135 >> load_scaling NONE >> complex_values gpu=2 >> >> qsub -l gpu=2 test.sh >> >> load_sensor is not needed. > > Yes, this is fine and preferred IMO. The ROCKS link uses only a BOOL complex > and put the amount in the queue definition instead, i.e. it can't be shared > across several queues. > > -- Reuti > > >> >> On 5/20/2012 12:44 PM, Reuti wrote: >>> Am 20.05.2012 um 10:21 schrieb Semi: >>> >>> >>>> Hi Rayson! >>>> >>>> Can I use this method for GPU definition? It's more clear for me. >>>> >>>> >>>> https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2011-August/054479.html >>> Requesting a CUDA complex and a CUDA queue looks redundant. And it's also >>> only for a single GPU per machine AFAICS.What's wrong with Rayson's setup, >>> it's just a different type of complex? >>> >>> -- Reuti >>> >>> >>> >>>> On 5/17/2012 7:09 PM, Rayson Ho wrote: >>>> >>>>> On Tue, May 15, 2012 at 6:11 AM, Semi<[email protected]> >>>>> wrote: >>>>> >>>>>> Can you give me more detailed answer and correct my definitions. >>>>>> >>>>> Hi Semi, >>>>> >>>>> I was away for the past 2 days. Please always cc the list when you are >>>>> replying (I guess Reuti, Ron, and I always suggest people to do that - >>>>> there are many ways to configure Grid Engine, and others may see >>>>> something that we don't see, and it is uaually better to get feedback >>>>> from more people). >>>>> >>>>> On the other hand, if you really need you might consider support ( >>>>> >>>>> http://www.scalablelogic.com/scalable-grid-engine-support >>>>> ). There is >>>>> always someone who can respond to your questions even when I am away. >>>>> >>>>> >>>>> >>>>>> qconf -sc|grep gpu >>>>>> gpu gpu INT<= YES YES >>>>>> 0 0 >>>>>> >>>>>> qconf -me sge135 >>>>>> hostname sge135 >>>>>> load_scaling NONE >>>>>> complex_values gpu=2 >>>>>> >>>>>> qconf -mconf sge135 >>>>>> sge135: >>>>>> mailer /bin/mail >>>>>> xterm /usr/bin/X11/xterm >>>>>> qlogin_daemon /usr/sbin/in.telnetd >>>>>> rlogin_daemon /usr/sbin/in.rlogind >>>>>> load_sensor /storage/SGE6U8/gpu-load-sensor/cuda_sensor >>>>>> >>>>> Note that if you statically define a host to have 2 GPUs, then you >>>>> don't need to use the cuda_sensor. The GPU load sensor distributed by >>>>> the Open Grid Scheduler project (which you can find in other Grid >>>>> Engine implementations) is very similar to Bright Computing's GPU >>>>> Management in the Bright Cluster Manager: >>>>> >>>>> >>>>> http://www.brightcomputing.com/NVIDIA-GPU-Cluster-Management-Monitoring.php >>>>> >>>>> >>>>> We both monitor temperature, fan speed, voltage, ECC, etc. When we >>>>> started the GPU load sensor development we didn't know that Bright had >>>>> something similar... >>>>> >>>>> From a scheduling point of view, you can ignore most of that. Some >>>>> sites like to bias node priority based on GPU temperature, and in some >>>>> cases if the ECC error is real bad then the GPU should not be used for >>>>> GPU jobs. >>>>> >>>>> >>>>> >>>>>> qsub -l gpu=1 test.sh >>>>>> >>>>>> And if I need parallel run on GPU. What I have to do? How define pe for >>>>>> GPU? >>>>>> >>>>> You just use "qsub -l gpu=2" if you want to use 2 GPUs for that job. >>>>> >>>>> Rayson >>>>> >>>>> >>>>> >>>>> >>>>>> On 5/14/2012 2:51 PM, Rayson Ho wrote: >>>>>> >>>>>> Just get the load sensor from: >>>>>> >>>>>> >>>>>> https://gridscheduler.svn.sourceforge.net/svnroot/gridscheduler/trunk/source/dist/gpu/gpu_sensor.c >>>>>> >>>>>> >>>>>> Compile it on your system - and make sure that it has the CUDA SDK& >>>>>> libraries installed (Google is your friend - look for the nvidia-ml >>>>>> library). >>>>>> >>>>>> % cc gpu_sensor.c -lnvidia-ml >>>>>> >>>>>> Before you use it as a load sensor, compile and run it interactively: >>>>>> >>>>>> % cc gpu_sensor.c -DSTANDALONE -lnvidia-ml >>>>>> >>>>>> Make sure that the code is reporting something meaningful on your system. >>>>>> >>>>>> Rayson >>>>>> >>>>>> >>>>>> >>>>>> On Mon, May 14, 2012 at 4:55 AM, Semi >>>>>> <[email protected]> >>>>>> wrote: >>>>>> >>>>>> Please help in GPU integration under SGE and parallel running of NAMD and >>>>>> GAMESS on GPU via SGE. >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> >>>>>> [email protected] >>>>>> https://gridengine.org/mailman/listinfo/users >>>> _______________________________________________ >>>> users mailing list >>>> >>>> [email protected] >>>> https://gridengine.org/mailman/listinfo/users > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
