I'm trying to implement the instructions given here
http://developer.nvidia.com/ganglia-monitoring-system
on one of our Rocks 5.4.2 clusters that has 2 GPU cards
in every compute node.
Just for completeness purposes, I should mention that
I have got this set up to work on two of our standalone
workstations running RHEL 6.x
http://dirac.dcs.it.mtu.edu/ganglia/
and the steps I followed are documented here:
http://sgowtham.net/blog/2012/02/11/ganglia-gmond-python-module-for-monitoring-nvidia-gpu/
Part #1: Python bindings for the NVML
http://pypi.python.org/pypi/nvidia-ml-py/
This requires Python to be newer than 2.4 - following
Phil Papadopoulos' instructions in a recent email on
the Rocks mailing list, I got Python 2.7 and 3.x to
install; and used that to get these Python bindings
for NVML to install.
Following are the commands I used on front end as well
as the compute nodes:
cd /share/apps/tmp/
wget
http://pypi.python.org/packages/source/n/nvidia-ml-py/nvidia-ml-py-2.285.01.tar.gz
cd /tmp/
tar -zxvf /share/apps/tmp/nvidia-ml-py-2.285.01.tar.gz
cd nvidia-ml-py-2.285.01
/opt/python/bin/python2.7 setup.py install
Process completes with no errors, with this output:
running install
running build
running build_py
running install_lib
running install_egg_info
Writing
/opt/python/lib/python2.7/site-packages/nvidia_ml_py-2.285.01-py2.7.egg-info
Part #2: Ganglia/gmond python modules & web patch
I downloaded
ganglia-gmond_python_modules-3dfa553.tar.gz
from
https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
to /share/apps/tmp/ and the commands run afterwards
on front end are as follows:
cd /tmp/
cp nvidia-ml-py-2.285.01/nvidia_smi.py
/opt/ganglia/lib64/ganglia/python_modules/
cp nvidia-ml-py-2.285.01/pynvml.py /opt/ganglia/lib64/ganglia/python_modules/
tar -zxvf /share/apps/tmp/ganglia-gmond_python_modules-3dfa553.tar.gz
cd ganglia-gmond_python_modules-3dfa553
cp python_modules/nvidia.py /opt/ganglia/lib64/ganglia/python_modules/
cp conf.d/nvidia.pyconf /etc/ganglia/conf.d/
cp conf.d/nvidia.pyconf /opt/ganglia/etc/conf.d/
cp graph.d/*.php /var/www/html/ganglia/graph.d/
cd /var/www/html/ganglia/
patch -p0 <
/tmp/ganglia-gmond_python_modules-3dfa553/gpu/nvidia/ganglia_web.patch
/etc/init.d/gmetad restart
/etc/init.d/gmond restart
Then on the compute node, I did the following:
cd /tmp/
cp nvidia-ml-py-2.285.01/nvidia_smi.py /opt/ganglia/lib64/ganglia/python_modu$
cp nvidia-ml-py-2.285.01/pynvml.py /opt/ganglia/lib64/ganglia/python_modules/
tar -zxvf /share/apps/tmp/ganglia-gmond_python_modules-3dfa553.tar.gz
cd ganglia-gmond_python_modules-3dfa553
cp python_modules/nvidia.py /opt/ganglia/lib64/ganglia/python_modules/
cp conf.d/nvidia.pyconf /etc/ganglia/conf.d/
cp conf.d/nvidia.pyconf /opt/ganglia/etc/conf.d/
/etc/init.d/gmond restart
When I point the browswer to cluster's ganglia page and
click on 'compute-0-0', GPU metrics do not show up.
What am I doing wrong? Did I miss something simple /
important? Does this have anything to do with the
fact that most of Rocks utilities are built with
python 2.4 while this new fancy thing is compiled
with python 2.7?
If any of you have tried this on your cluster and
got it to work, I'd greatly appreciate some direction.
Thanks for your time and help.
Best,
g
--
Gowtham
Information Technology Services
Michigan Technological University
(906) 487/3593
http://www.it.mtu.edu/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users