Re: [gridengine users] Monitor GPU memory usage per job

Dave Love Thu, 25 Apr 2013 09:59:35 -0700

Stephen Willey <[email protected]> writes:

> You could use a load sensor to do this.  We use one to detect if
> people are logged in and suspend/requeue the jobs if someone logs in
> while a job's on their workstation.


I don't understand how that addresses the question (as I understand it).

> http://arc.liv.ac.uk/SGE/howto/loadsensor.html shows you how to make
> one, then you'd set your queue to have a load/suspend threshold set at
> whatever you'd like (configurable per queue instance or
> host/hostgroup).

But that's not specific to a job/task.

> You'd probably use nvidia-smi (assuming you're on Linux) to get the
> card details out and parse them to form the load figure.

What's missing from
<http://arc.liv.ac.uk/repos/darcs/sge/source/dist/util/resources/loadsensors/gpu-loadsensor.c>?
It reports what seemed to be all the useful information that I could see
how to extract from CUDA and OpenCL devices with the then-current
support.  Enhancements would be welcome, but I'm not convinced it's
terribly useful.

> There are a
> few more related details here:
> http://serverfault.com/questions/322073/howto-set-up-sge-for-cuda-devices

As an example, I don't think that deals with the sort of usage that's
supposed to be made of GPUs here, with mixed graphics/computation and
shared/exclusive access.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Monitor GPU memory usage per job

Reply via email to