Hello everybody,

I am using SLURM 2.6.2 with MVAPICH2 in a cluster (with the configuration
attached).

When I try to submit this command:

> srun --gres=gpu:1 $path_mpirun -np 1 $path_app/lmp_g++ -var x 2 -var y 2
> -var z 4 -sf cuda < $path_test/in.lj

I get the following error from srun:

> srun: error: Unable to create job step: Access/permission denied

Moreover, you will find attached the log of the controller and the compute
node.

Any idea about what happend?

Regards!

-- 
*Sergio Iserte Agut, research assistant,*
*High Performance Computing & Architecture*
*Jaume I University (Castellón, Spain)*

Attachment: slurm.conf
Description: Binary data

slurmctld: debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=1053
slurmctld: debug3: JobDesc: user_id=1053 job_id=-1 partition=(null) name=mpirun
slurmctld: debug3:    cpus=1-4294967294 pn_min_cpus=-1
slurmctld: debug3:    -N min-[max]: 1-[4294967294]:65534:65534:65534
slurmctld: debug3:    pn_min_memory_job=-1 pn_min_tmp_disk=-1
slurmctld: debug3:    immediate=0 features=(null) reservation=(null)
slurmctld: debug3:    req_nodes=(null) exc_nodes=(null) gres=gpu:1
slurmctld: debug3:    time_limit=-1--1 priority=-1 contiguous=0 shared=-1
slurmctld: debug3:    kill_on_node_fail=-1 script=(null)
slurmctld: debug3:    argv="/nfs/LIBS/LIBS/MVAPICH2/2.0b/bin/mpirun"
slurmctld: debug3:    stdin=(null) stdout=(null) stderr=(null)
slurmctld: debug3:    work_dir=/nfs/gap/siserte/lammps55_executions alloc_node:sid=mlxc2:3952
slurmctld: debug3:    resp_host=192.168.0.2 alloc_resp_port=37264  other_port=44587
slurmctld: debug3:    dependency=(null) account=(null) qos=(null) comment=(null)
slurmctld: debug3:    mail_type=0 mail_user=(null) nice=55534 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null)
slurmctld: debug3:    network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null)
slurmctld: debug3:    end_time=Unknown signal=0@0 wait_all_nodes=-1
slurmctld: debug3:    ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1
slurmctld: debug3:    cpus_bind=65534:(null) mem_bind=65534:(null) plane_size:65534
slurmctld: debug3:    array_inx=(null)
slurmctld: debug3: found correct qos
slurmctld: sched: update_job: invalid gres (null) for job 14164
slurmctld: debug2: found 1 usable nodes from config containing mlxc1i1
slurmctld: debug2: found 1 usable nodes from config containing mlxc2i1
slurmctld: debug3: _pick_best_nodes: job 14164 idle_nodes 2 share_nodes 2
slurmctld: debug2: select_p_job_test for job 14164
slurmctld: debug3: cons_res: _vns: node mlxc1i1 lacks gres
slurmctld: debug3: cons_res: _add_job_to_res: job 14164 act 0 
slurmctld: debug3: cons_res: adding job 14164 to part main row 0
slurmctld: debug2: sched: JobId=14164 allocated resources: NodeList=mlxc2i1
slurmctld: sched: _slurm_rpc_allocate_resources JobId=14164 NodeList=mlxc2i1 usec=112103
slurmctld: debug2: _slurm_rpc_job_ready(14164)=3 usec=2
slurmctld: debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=1053
slurmctld: debug3: StepDesc: user_id=1053 job_id=14164 node_count=1-1 cpu_count=2
slurmctld: debug3:    cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=(null)
slurmctld: debug3:    host=mlxc2 port=42038 name=mpirun network=(null) exclusive=0
slurmctld: debug3:    checkpoint-dir=/nfs/gap/siserte/lammps55_executions checkpoint_int=0
slurmctld: debug3:    mem_per_node=0 resv_port_cnt=65534 immediate=0 no_kill=0
slurmctld: debug3:    overcommit=0 time_limit=0 gres=gpu:1 constraints=(null)
slurmctld: debug:  Configuration for job 14164 complete
slurmctld: debug3: step_layout cpus = 2 pos = 0
slurmctld: debug:  laying out the 1 tasks on 1 hosts mlxc2i1 dist 1
slurmctld: sched: _slurm_rpc_job_step_create: StepId=14164.0 mlxc2i1 usec=20934
slurmctld: debug2: Processing RPC: REQUEST_JOB_ALLOCATION_INFO_LITE from uid=1901
slurmctld: debug:  _slurm_rpc_job_alloc_info_lite JobId=14164 NodeList=mlxc2i1 usec=2831
slurmctld: debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=1901
slurmctld: debug3: StepDesc: user_id=1901 job_id=14164 node_count=1-1 cpu_count=1
slurmctld: debug3:    cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=(null)
slurmctld: debug3:    host=mlxc2 port=40226 name=hydra_pmi_proxy network=(null) exclusive=0
slurmctld: debug3:    checkpoint-dir=/nfs/gap/siserte/lammps55_executions checkpoint_int=0
slurmctld: debug3:    mem_per_node=0 resv_port_cnt=65534 immediate=0 no_kill=0
slurmctld: debug3:    overcommit=0 time_limit=0 gres=(null) constraints=(null)
slurmctld: _slurm_rpc_job_step_create for job 14164: Access/permission denied
slurmctld: debug3: Writing job id 14164 to header record of job_state file

slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6001
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug:  task_slurmd_launch_request: 14166.0 0
slurmd: launch task 14166.0 request from [email protected] (port 6859)
slurmd: debug:  Checking credential with 340 bytes of sig data
slurmd: scaling CPU count by factor of 2
slurmd: debug:  Calling /nfs/gap/slurm/sbin/slurmstepd spank prolog
spank-prolog: Reading slurm.conf file: /home/siserte/slurm_conf/slurm.conf
spank-prolog: Running spank/prolog for jobid [14166] uid [1053]
spank-prolog: spank: opening plugin stack /home/siserte/slurm_conf/plugstack.conf
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank 0 (mlxc2i1), parent rank -1 (NONE), children 0, depth 0, max_depth 0
slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug:  task_slurmd_reserve_resources: 14166 0
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  Sending signal 995 to step 14166.0

Reply via email to