Hi again,
I don't know how, but I have solved it. I have just renamed the file
gres.conf and renamed back, changed some parameters on slurm.conf, go
back to my desired configuration and reboot... and now works!
To who may feel interested on how I have managed to set different
numbers of devices in the CUDA_VISIBLE_DEVICES, I have configured a
task_prolog file, as follows:
---- tprolog.sh:
/
//EXP_D=`echo $CUDA_VISIBLE_DEVICES | tr "," " "`//
//
//for i in $EXP_D//
//do//
// d=`expr $i + 2`//
// if [ -z "$____firs__t__ime____" ]//
// then//
// NEW_DEVICES=$d//
// else//
// NEW_DEVICES=$NEW_DEVICES,$d//
// fi//
//
// ____firs__t__ime____=0//
//done//
//
//echo "export CUDA_VISIBLE_DEVICES=$NEW_DEVICES"/
-----
Now I think it should be enough for my single server.
Thanks and best,
Miguel
El 03/08/13 12:46, Miguel Ángel Martínez del Amor escribió:
Hi all,
I'm pretty new with SLURM. I'm moving from Grid Engine looking for
better GPU management.
We have one server (Ubuntu server 12.04 64bits, SLURM 2.3.2) with 4
GPUs, but they are specially distributed: device 0 is for testing,
device 1 is a Fermi GPU (for testing as well), and devices 2 and 3
(same GPU as device 0) are going to be managed by SLURM.
I have configured the slurm.conf as seen attached, and gres.conf as
follows:
/Name=gpu File=/dev/nvidia2 CPUs=[0-3]//
//Name=gpu File=/dev/nvidia3 CPUs=[4-7]/
My problem arises when I launch sbatch or srun, I got the following
error (only when using --gres=gpu, if I delete --gres, it works fine):
/$ sbatch --gres=gpu:1 show_device.sh //
//sbatch: error: Batch job submission failed: Requested node
configuration is not available//
//
//$ sbatch -n 2 --gres=gpu:2 show_device.sh //
//sbatch: error: Batch job submission failed: Requested node
configuration is not available//
//
//$ srun -n 2 --gres=gpu:2 show_device.sh //
//srun: error: Unable to allocate resources: Requested node
configuration is not available/
I guess something is wrong with my configuration. I think my problem
is really related with
https://groups.google.com/forum/#!topic/slurm-devel/duLt-jPBGp4
<https://groups.google.com/forum/#%21topic/slurm-devel/duLt-jPBGp4>,
but there is still no solution.
Moreover, do you think that SLURM is going to assign to
CUDA_VISIBLE_DEVICES only devices 2 and 3, or it is going to assign
from 0 (i.e. devices 0 and 1). Therefore, what do you suggest to me?
Do I have to configure a pre-script adding 2 to each value in
CUDA_VISIBLE_DEVICES? How can I do it automatically by default for any
user?
Thank you very much in advance.
Best,
Miguel
P.S.: show_device.sh is just a script for testing and understanding SLURM:
/#!/bin/bash//
//
//echo Hostname=`hostname`//
//echo PWD=`pwd`//
//echo USER=`whoami`//
//echo PATH=$PATH//
//echo LD_LIBRARY_PATH=$LD_LIBRARY_PATH//
//echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES//
//
/
--
Miguel Ángel Martínez del Amor, Ph.D.
Research Group on Natural Computing (RGNC).
Department of Computer Science and Artificial Intelligence.
E.T.S. Ingeniería Informática, 41012 Avda. Reina Mercedes.
University of Seville, Sevilla (Spain).
Webpage:http://www.gcn.us.es/mdelamor
--
Miguel Ángel Martínez del Amor, Ph.D.
Research Group on Natural Computing (RGNC).
Department of Computer Science and Artificial Intelligence.
E.T.S. Ingeniería Informática, 41012 Avda. Reina Mercedes.
University of Seville, Sevilla (Spain).
Webpage: http://www.gcn.us.es/mdelamor
Tel.: (+34)954 557 953