I'm trying to implement the instructions given here

 http://developer.nvidia.com/ganglia-monitoring-system

on one of our Rocks 5.4.2 clusters that has 2 GPU cards
in every compute node.


Just for completeness purposes, I should mention that
I have got this set up to work on two of our standalone
workstations running RHEL 6.x

  http://dirac.dcs.it.mtu.edu/ganglia/

and the steps I followed are documented here:

  
http://sgowtham.net/blog/2012/02/11/ganglia-gmond-python-module-for-monitoring-nvidia-gpu/




Part #1: Python bindings for the NVML

 http://pypi.python.org/pypi/nvidia-ml-py/

This requires Python to be newer than 2.4 - following
Phil Papadopoulos' instructions in a recent email on
the Rocks mailing list, I got Python 2.7 and 3.x to
install; and used that to get these Python bindings
for NVML to install.


Following are the commands I used on front end as well
as the compute nodes:

  cd /share/apps/tmp/
  wget 
http://pypi.python.org/packages/source/n/nvidia-ml-py/nvidia-ml-py-2.285.01.tar.gz

  cd /tmp/
  tar -zxvf /share/apps/tmp/nvidia-ml-py-2.285.01.tar.gz
  cd nvidia-ml-py-2.285.01
  /opt/python/bin/python2.7 setup.py install


Process completes with no errors, with this output:

  running install
  running build
  running build_py
  running install_lib
  running install_egg_info
  Writing 
/opt/python/lib/python2.7/site-packages/nvidia_ml_py-2.285.01-py2.7.egg-info



Part #2: Ganglia/gmond python modules & web patch

I downloaded

  ganglia-gmond_python_modules-3dfa553.tar.gz

from

  https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia

to /share/apps/tmp/ and the commands run afterwards
on front end are as follows:

  cd /tmp/
  cp nvidia-ml-py-2.285.01/nvidia_smi.py 
/opt/ganglia/lib64/ganglia/python_modules/
  cp nvidia-ml-py-2.285.01/pynvml.py /opt/ganglia/lib64/ganglia/python_modules/

  tar -zxvf /share/apps/tmp/ganglia-gmond_python_modules-3dfa553.tar.gz
  cd ganglia-gmond_python_modules-3dfa553
  cp python_modules/nvidia.py /opt/ganglia/lib64/ganglia/python_modules/
  cp conf.d/nvidia.pyconf /etc/ganglia/conf.d/
  cp conf.d/nvidia.pyconf /opt/ganglia/etc/conf.d/
  cp graph.d/*.php /var/www/html/ganglia/graph.d/

  cd /var/www/html/ganglia/
  patch -p0 < 
/tmp/ganglia-gmond_python_modules-3dfa553/gpu/nvidia/ganglia_web.patch

  /etc/init.d/gmetad restart
  /etc/init.d/gmond restart



Then on the compute node, I did the following:

  cd /tmp/
  cp nvidia-ml-py-2.285.01/nvidia_smi.py /opt/ganglia/lib64/ganglia/python_modu$
  cp nvidia-ml-py-2.285.01/pynvml.py /opt/ganglia/lib64/ganglia/python_modules/

  tar -zxvf /share/apps/tmp/ganglia-gmond_python_modules-3dfa553.tar.gz
  cd ganglia-gmond_python_modules-3dfa553
  cp python_modules/nvidia.py /opt/ganglia/lib64/ganglia/python_modules/
  cp conf.d/nvidia.pyconf /etc/ganglia/conf.d/
  cp conf.d/nvidia.pyconf /opt/ganglia/etc/conf.d/

  /etc/init.d/gmond restart



When I point the browswer to cluster's ganglia page and
click on 'compute-0-0', GPU metrics do not show up.

What am I doing wrong? Did I miss something simple /
important? Does this have anything to do with the
fact that most of Rocks utilities are built with
python 2.4 while this new fancy thing is compiled
with python 2.7?

If any of you have tried this on your cluster and
got it to work, I'd greatly appreciate some direction.

Thanks for your time and help.

Best,
g

--
Gowtham
Information Technology Services
Michigan Technological University

(906) 487/3593
http://www.it.mtu.edu/

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to