Hello everybody, I am using SLURM 2.6.2 with MVAPICH2 in a cluster (with the configuration attached).
When I try to submit this command: > srun --gres=gpu:1 $path_mpirun -np 1 $path_app/lmp_g++ -var x 2 -var y 2 > -var z 4 -sf cuda < $path_test/in.lj I get the following error from srun: > srun: error: Unable to create job step: Access/permission denied Moreover, you will find attached the log of the controller and the compute node. Any idea about what happend? Regards! -- *Sergio Iserte Agut, research assistant,* *High Performance Computing & Architecture* *Jaume I University (Castellón, Spain)*
slurm.conf
Description: Binary data
slurmctld: debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=1053 slurmctld: debug3: JobDesc: user_id=1053 job_id=-1 partition=(null) name=mpirun slurmctld: debug3: cpus=1-4294967294 pn_min_cpus=-1 slurmctld: debug3: -N min-[max]: 1-[4294967294]:65534:65534:65534 slurmctld: debug3: pn_min_memory_job=-1 pn_min_tmp_disk=-1 slurmctld: debug3: immediate=0 features=(null) reservation=(null) slurmctld: debug3: req_nodes=(null) exc_nodes=(null) gres=gpu:1 slurmctld: debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1 slurmctld: debug3: kill_on_node_fail=-1 script=(null) slurmctld: debug3: argv="/nfs/LIBS/LIBS/MVAPICH2/2.0b/bin/mpirun" slurmctld: debug3: stdin=(null) stdout=(null) stderr=(null) slurmctld: debug3: work_dir=/nfs/gap/siserte/lammps55_executions alloc_node:sid=mlxc2:3952 slurmctld: debug3: resp_host=192.168.0.2 alloc_resp_port=37264 other_port=44587 slurmctld: debug3: dependency=(null) account=(null) qos=(null) comment=(null) slurmctld: debug3: mail_type=0 mail_user=(null) nice=55534 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null) slurmctld: debug3: network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null) slurmctld: debug3: end_time=Unknown signal=0@0 wait_all_nodes=-1 slurmctld: debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1 slurmctld: debug3: cpus_bind=65534:(null) mem_bind=65534:(null) plane_size:65534 slurmctld: debug3: array_inx=(null) slurmctld: debug3: found correct qos slurmctld: sched: update_job: invalid gres (null) for job 14164 slurmctld: debug2: found 1 usable nodes from config containing mlxc1i1 slurmctld: debug2: found 1 usable nodes from config containing mlxc2i1 slurmctld: debug3: _pick_best_nodes: job 14164 idle_nodes 2 share_nodes 2 slurmctld: debug2: select_p_job_test for job 14164 slurmctld: debug3: cons_res: _vns: node mlxc1i1 lacks gres slurmctld: debug3: cons_res: _add_job_to_res: job 14164 act 0 slurmctld: debug3: cons_res: adding job 14164 to part main row 0 slurmctld: debug2: sched: JobId=14164 allocated resources: NodeList=mlxc2i1 slurmctld: sched: _slurm_rpc_allocate_resources JobId=14164 NodeList=mlxc2i1 usec=112103 slurmctld: debug2: _slurm_rpc_job_ready(14164)=3 usec=2 slurmctld: debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=1053 slurmctld: debug3: StepDesc: user_id=1053 job_id=14164 node_count=1-1 cpu_count=2 slurmctld: debug3: cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=(null) slurmctld: debug3: host=mlxc2 port=42038 name=mpirun network=(null) exclusive=0 slurmctld: debug3: checkpoint-dir=/nfs/gap/siserte/lammps55_executions checkpoint_int=0 slurmctld: debug3: mem_per_node=0 resv_port_cnt=65534 immediate=0 no_kill=0 slurmctld: debug3: overcommit=0 time_limit=0 gres=gpu:1 constraints=(null) slurmctld: debug: Configuration for job 14164 complete slurmctld: debug3: step_layout cpus = 2 pos = 0 slurmctld: debug: laying out the 1 tasks on 1 hosts mlxc2i1 dist 1 slurmctld: sched: _slurm_rpc_job_step_create: StepId=14164.0 mlxc2i1 usec=20934 slurmctld: debug2: Processing RPC: REQUEST_JOB_ALLOCATION_INFO_LITE from uid=1901 slurmctld: debug: _slurm_rpc_job_alloc_info_lite JobId=14164 NodeList=mlxc2i1 usec=2831 slurmctld: debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=1901 slurmctld: debug3: StepDesc: user_id=1901 job_id=14164 node_count=1-1 cpu_count=1 slurmctld: debug3: cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=(null) slurmctld: debug3: host=mlxc2 port=40226 name=hydra_pmi_proxy network=(null) exclusive=0 slurmctld: debug3: checkpoint-dir=/nfs/gap/siserte/lammps55_executions checkpoint_int=0 slurmctld: debug3: mem_per_node=0 resv_port_cnt=65534 immediate=0 no_kill=0 slurmctld: debug3: overcommit=0 time_limit=0 gres=(null) constraints=(null) slurmctld: _slurm_rpc_job_step_create for job 14164: Access/permission denied slurmctld: debug3: Writing job id 14164 to header record of job_state file
slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6001 slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS slurmd: debug: task_slurmd_launch_request: 14166.0 0 slurmd: launch task 14166.0 request from [email protected] (port 6859) slurmd: debug: Checking credential with 340 bytes of sig data slurmd: scaling CPU count by factor of 2 slurmd: debug: Calling /nfs/gap/slurm/sbin/slurmstepd spank prolog spank-prolog: Reading slurm.conf file: /home/siserte/slurm_conf/slurm.conf spank-prolog: Running spank/prolog for jobid [14166] uid [1053] spank-prolog: spank: opening plugin stack /home/siserte/slurm_conf/plugstack.conf slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd slurmd: debug3: slurmstepd rank 0 (mlxc2i1), parent rank -1 (NONE), children 0, depth 0, max_depth 0 slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd slurmd: debug: task_slurmd_reserve_resources: 14166 0 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6004 slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS slurmd: debug: Sending signal 995 to step 14166.0
