On 2016-11-17 12:53, Manuel Rodríguez Pascual wrote:
> Hi all,
>
> I keep having some issues using Slurm + mvapich2. It seems that I cannot
> correctly configure Slurm and mvapich2 to work together. In particular,
> sbatch works correctly but srun does not.  Maybe someone here can
> provide me some guidance, as I suspect that the error is an obvious one,
> but I just cannot find it.
>
> CONFIGURATION INFO:
> I am employing Slurm 17.02.0-0pre2 and mvapich 2.2.
> Mvapich is compiled with "--disable-mcast --with-slurm=<my slurm
> location>"  <---there is a note about this at the bottom of the mail
> Slurm is compiled with no special options. After compilation, I executed
> "make && make install" in "contribs/pmi2/" (I read it somewhere)
> Slurm is configured with "MpiDefault=pmi2" in slurm.conf
>
> TESTS:
> I am executing a "helloWorldMPI" that displays a hello world message and
> writes down the node name for each MPI task.
>
> sbatch works perfectly:
>
> $ sbatch -n 2 --tasks-per-node=2 --wrap 'mpiexec  ./helloWorldMPI'
> Submitted batch job 750
>
> $ more slurm-750.out
> Process 0 of 2 is on acme12.ciemat.es <http://acme12.ciemat.es>
> Hello world from process 0 of 2
> Process 1 of 2 is on acme12.ciemat.es <http://acme12.ciemat.es>
> Hello world from process 1 of 2
>
> $sbatch -n 2 --tasks-per-node=1 -p debug --wrap 'mpiexec  ./helloWorldMPI'
> Submitted batch job 748
>
> $ more slurm-748.out
> Process 0 of 2 is on acme11.ciemat.es <http://acme11.ciemat.es>
> Hello world from process 0 of 2
> Process 1 of 2 is on acme12.ciemat.es <http://acme12.ciemat.es>
> Hello world from process 1 of 2
>
>
> However, srun fails.
> On a single node it works correctly:
> $ srun -n 2 --tasks-per-node=2   ./helloWorldMPI
> Process 0 of 2 is on acme11.ciemat.es <http://acme11.ciemat.es>
> Hello world from process 0 of 2
> Process 1 of 2 is on acme11.ciemat.es <http://acme11.ciemat.es>
> Hello world from process 1 of 2
>
> But when using more than one node, it fails. Below there is the
> experiment with a lot of debugging info, in case it helps.
>
> (note that the job ID will be different sometimes as this mail is the
> result of multiple submissions and copy/pastes)
>
> $ srun -n 2 --tasks-per-node=1   ./helloWorldMPI
> srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
> slurmstepd: error: *** STEP 753.0 ON acme11 CANCELLED AT
> 2016-11-17T10:19:47 ***
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: acme11: task 0: Killed
> srun: error: acme12: task 1: Killed
>
>
> Slurmctld output:
> slurmctld: debug2: Performing purge of old job records
> slurmctld: debug2: Performing full system state save
> slurmctld: debug3: Writing job id 753 to header record of job_state file
> slurmctld: debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION
> from uid=500
> slurmctld: debug3: JobDesc: user_id=500 job_id=N/A partition=(null)
> name=helloWorldMPI
> slurmctld: debug3:    cpus=2-4294967294 pn_min_cpus=-1 core_spec=-1
> slurmctld: debug3:    Nodes=1-[4294967294] Sock/Node=65534
> Core/Sock=65534 Thread/Core=65534
> slurmctld: debug3:    pn_min_memory_job=18446744073709551615
> pn_min_tmp_disk=-1
> slurmctld: debug3:    immediate=0 features=(null) reservation=(null)
> slurmctld: debug3:    req_nodes=(null) exc_nodes=(null) gres=(null)
> slurmctld: debug3:    time_limit=-1--1 priority=-1 contiguous=0 shared=-1
> slurmctld: debug3:    kill_on_node_fail=-1 script=(null)
> slurmctld: debug3:    argv="./helloWorldMPI"
> slurmctld: debug3:    stdin=(null) stdout=(null) stderr=(null)
> slurmctld: debug3:    work_dir=/home/slurm/tests alloc_node:sid=acme31:11229
> slurmctld: debug3:    power_flags=
> slurmctld: debug3:    resp_host=172.17.31.165 alloc_resp_port=56804
> other_port=33290
> slurmctld: debug3:    dependency=(null) account=(null) qos=(null)
> comment=(null)
> slurmctld: debug3:    mail_type=0 mail_user=(null) nice=0 num_tasks=2
> open_mode=0 overcommit=-1 acctg_freq=(null)
> slurmctld: debug3:    network=(null) begin=Unknown cpus_per_task=-1
> requeue=-1 licenses=(null)
> slurmctld: debug3:    end_time= signal=0@0 wait_all_nodes=-1 cpu_freq=
> slurmctld: debug3:    ntasks_per_node=1 ntasks_per_socket=-1
> ntasks_per_core=-1
> slurmctld: debug3:    mem_bind=65534:(null) plane_size:65534
> slurmctld: debug3:    array_inx=(null)
> slurmctld: debug3:    burst_buffer=(null)
> slurmctld: debug3:    mcs_label=(null)
> slurmctld: debug3:    deadline=Unknown
> slurmctld: debug3:    bitflags=0 delay_boot=4294967294
> slurmctld: debug3: User (null)(500) doesn't have a default account
> slurmctld: debug3: User (null)(500) doesn't have a default account
> slurmctld: debug3: found correct qos
> slurmctld: debug3: before alteration asking for nodes 1-4294967294 cpus
> 2-4294967294
> slurmctld: debug3: after alteration asking for nodes 1-4294967294 cpus
> 2-4294967294
> slurmctld: debug2: found 8 usable nodes from config containing
> acme[11-14,21-24]
> slurmctld: debug3: _pick_best_nodes: job 754 idle_nodes 8 share_nodes 8
> slurmctld: debug5: powercapping: checking job 754 : skipped, capping
> disabled
> slurmctld: debug2: sched: JobId=754 allocated resources:
> NodeList=acme[11-12]
> slurmctld: sched: _slurm_rpc_allocate_resources JobId=754
> NodeList=acme[11-12] usec=1340
> slurmctld: debug3: Writing job id 754 to header record of job_state file
> slurmctld: debug2: _slurm_rpc_job_ready(754)=3 usec=4
> slurmctld: debug3: StepDesc: user_id=500 job_id=754 node_count=2-2
> cpu_count=2 num_tasks=2
> slurmctld: debug3:    cpu_freq_gov=4294967294 cpu_freq_max=4294967294
> cpu_freq_min=4294967294 relative=65534 task_dist=0x1 plane=1
> slurmctld: debug3:    node_list=(null)  constraints=(null)
> slurmctld: debug3:    host=acme31 port=36711 srun_pid=8887
> name=helloWorldMPI network=(null) exclusive=0
> slurmctld: debug3:    checkpoint-dir=/home/localsoft/slurm/checkpoint
> checkpoint_int=0
> slurmctld: debug3:    mem_per_node=0 resv_port_cnt=65534 immediate=0
> no_kill=0
> slurmctld: debug3:    overcommit=0 time_limit=0 gres=(null)
> slurmctld: _pick_step_nodes: Configuration for job 754 is complete
> slurmctld: debug3: step_layout cpus = 16 pos = 0
> slurmctld: debug3: step_layout cpus = 16 pos = 1
> slurmctld: debug:  laying out the 2 tasks on 2 hosts acme[11-12] dist 1
> slurmctld: debug2: Testing job time limits and checkpoints
> slurmctld: debug2: Performing purge of old job records
> slurmctld: debug:  sched: Running job scheduler
> slurmctld: debug3: Writing job id 754 to header record of job_state file
> slurmctld: debug2: Performing purge of old job records
> slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
> slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_SIGNAL_TASKS
> slurmctld: debug2: got 1 threads to send out
> slurmctld: debug2: got 1 threads to send out
> slurmctld: debug2: Tree head got back 0 looking for 2
> slurmctld: debug3: Tree sending to acme11
> slurmctld: debug3: Tree sending to acme12
> slurmctld: debug3: slurm_send_only_node_msg: sent 181
> slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout
> of 10000
> slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout
> of 10000
> slurmctld: debug3: Writing job id 754 to header record of job_state file
> slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
> slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_SIGNAL_TASKS
> slurmctld: debug2: got 1 threads to send out
> slurmctld: debug2: got 1 threads to send out
> slurmctld: debug2: Tree head got back 0 looking for 2
> slurmctld: debug3: Tree sending to acme12
> slurmctld: debug3: Tree sending to acme11
> slurmctld: debug3: slurm_send_only_node_msg: sent 181
> slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout
> of 10000
> slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout
> of 10000
> slurmctld: debug2: Tree head got back 1
> slurmctld: debug2: Tree head got back 2
> slurmctld: debug2: Tree head got back 1
> slurmctld: debug2: Tree head got back 2
> slurmctld: debug2: RPC to node acme12 failed, job not running
> slurmctld: debug2: RPC to node acme11 failed, job not running
> slurmctld: debug2: Processing RPC: REQUEST_COMPLETE_JOB_ALLOCATION from
> uid=500, JobId=754 rc=9
> slurmctld: job_complete: JobID=754 State=0x1 NodeCnt=2 WTERMSIG 9
> slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
> slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
> slurmctld: debug2: got 1 threads to send out
> slurmctld: debug3: User (null)(500) doesn't have a default account
> slurmctld: debug2: got 1 threads to send out
> slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
> slurmctld: job_complete: JobID=754 State=0x8003 NodeCnt=2 done
> slurmctld: debug2: _slurm_rpc_complete_job_allocation: JobID=754
> State=0x8003 NodeCnt=2
> slurmctld: debug2: got 1 threads to send out
> slurmctld: debug3: Tree sending to acme11
> slurmctld: debug3: slurm_send_only_node_msg: sent 181
> slurmctld: debug2: Tree head got back 0 looking for 2
> slurmctld: debug3: Tree sending to acme12
> slurmctld: debug3: slurm_send_only_node_msg: sent 181
> slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout
> of 10000
> slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout
> of 10000
> slurmctld: debug2: node_did_resp acme12
> slurmctld: debug2: node_did_resp acme11
> slurmctld: debug2: node_did_resp acme12
> slurmctld: debug2: node_did_resp acme11
> slurmctld: debug2: Tree head got back 1
> slurmctld: debug2: Tree head got back 2
> slurmctld: debug2: node_did_resp acme11
> slurmctld: debug2: node_did_resp acme12
> slurmctld: debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE uid=0
> slurmctld: debug2: _slurm_rpc_epilog_complete: JobID=754 State=0x8003
> NodeCnt=1 Node=acme12
> slurmctld: debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE uid=0
> slurmctld: debug2: _slurm_rpc_epilog_complete: JobID=754 State=0x3
> NodeCnt=0 Node=acme11
> slurmctld: debug:  sched: Running job scheduler
> slurmctld: debug3: Writing job id 754 to header record of job_state file
> slurmctld: debug2: Performing purge of old job records
>
>
>
> slurmd (one node)
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6001
> slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
> slurmd: launch task 754.0 request from 500.1001@172.17.31.165
> <mailto:500.1001@172.17.31.165> (port 38631)
> slurmd: debug3: state for jobid 744: ctime:1479371687 revoked:0 expires:0
> slurmd: debug3: state for jobid 745: ctime:1479371707 revoked:0 expires:0
> slurmd: debug3: state for jobid 751: ctime:1479374214 revoked:0 expires:0
> slurmd: debug3: state for jobid 752: ctime:1479374335 revoked:1479374340
> expires:1479374340
> slurmd: debug3: state for jobid 752: ctime:1479374335 revoked:0 expires:0
> slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:1479374387
> expires:1479374387
> slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:0 expires:0
> slurmd: debug:  Checking credential with 300 bytes of sig data
> slurmd: debug:  task_p_slurmd_launch_request: 754.0 0
> slurmd: debug:  Calling /home/localsoft/slurm/sbin/slurmstepd spank prolog
> spank-prolog: debug:  Reading slurm.conf file:
> /home/localsoft/slurm/etc/slurm.conf
> spank-prolog: debug:  Running spank/prolog for jobid [754] uid [500]
> spank-prolog: debug:  spank: opening plugin stack
> /home/localsoft/slurm/etc/plugstack.conf
> slurmd: _run_prolog: run job script took usec=11010
> slurmd: _run_prolog: prolog with lock for job 754 ran for 0 seconds
> slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
> slurmd: debug3: slurmstepd rank 0 (acme11), parent rank -1 (NONE),
> children 1, depth 0, max_depth 1
> slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
> slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
> slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
> slurmd: debug:  task_p_slurmd_reserve_resources: 754 0
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6004
> slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
> slurmd: debug:  _rpc_signal_tasks: sending signal 995 to step 754.0 flag 0
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 5029
> slurmd: debug3: Entering _rpc_forward_data, address:
> /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6004
> slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
> slurmd: debug:  _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6004
> slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
> slurmd: debug:  _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 5016
> slurmd: debug3: Entering _rpc_step_complete
> slurmd: debug:  Entering stepd_completion for 754.0, range_first = 1,
> range_last = 1
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6011
> slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
> slurmd: debug:  _rpc_terminate_job, uid = 500
> slurmd: debug:  task_p_slurmd_release_resources: 754
> slurmd: debug3: state for jobid 744: ctime:1479371687 revoked:0 expires:0
> slurmd: debug3: state for jobid 745: ctime:1479371707 revoked:0 expires:0
> slurmd: debug3: state for jobid 751: ctime:1479374214 revoked:0 expires:0
> slurmd: debug3: state for jobid 752: ctime:1479374335 revoked:1479374340
> expires:1479374340
> slurmd: debug3: state for jobid 752: ctime:1479374335 revoked:0 expires:0
> slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:1479374387
> expires:1479374387
> slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:0 expires:0
> slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0
> slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0
> slurmd: debug:  credential for job 754 revoked
> slurmd: debug2: No steps in jobid 754 to send signal 18
> slurmd: debug2: No steps in jobid 754 to send signal 15
> slurmd: debug4: sent SUCCESS
> slurmd: debug2: set revoke expiration for jobid 754 to 1479374560 UTS
> slurmd: debug4: unable to create link for
> /home/localsoft/slurm/spool//cred_state ->
> /home/localsoft/slurm/spool//cred_state.old: No such file or directory
> slurmd: debug4: unable to create link for
> /home/localsoft/slurm/spool//cred_state.new ->
> /home/localsoft/slurm/spool//cred_state: No such file or directory
> slurmd: debug:  Waiting for job 754's prolog to complete
> slurmd: debug:  Finished wait for job 754's prolog to complete
> slurmd: debug:  Calling /home/localsoft/slurm/sbin/slurmstepd spank epilog
> spank-epilog: debug:  Reading slurm.conf file:
> /home/localsoft/slurm/etc/slurm.conf
> spank-epilog: debug:  Running spank/epilog for jobid [754] uid [500]
> spank-epilog: debug:  spank: opening plugin stack
> /home/localsoft/slurm/etc/plugstack.conf
> slurmd: debug:  completed epilog for jobid 754
> slurmd: debug3: slurm_send_only_controller_msg: sent 192
> slurmd: debug:  Job 754: sent epilog complete msg: rc = 0
>
>
>
> slurmd (other node)
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6001
> slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
> slurmd: launch task 754.0 request from 500.1001@172.17.31.165
> <mailto:500.1001@172.17.31.165> (port 47784)
> slurmd: debug3: state for jobid 744: ctime:1479371687 revoked:0 expires:0
> slurmd: debug3: state for jobid 745: ctime:1479371707 revoked:0 expires:0
> slurmd: debug3: state for jobid 746: ctime:1479371733 revoked:0 expires:0
> slurmd: debug3: state for jobid 747: ctime:1479371785 revoked:0 expires:0
> slurmd: debug3: state for jobid 748: ctime:1479374028 revoked:0 expires:0
> slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:1479374387
> expires:1479374387
> slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:0 expires:0
> slurmd: debug:  Checking credential with 300 bytes of sig data
> slurmd: debug:  task_p_slurmd_launch_request: 754.0 1
> slurmd: debug:  Calling /home/localsoft/slurm/sbin/slurmstepd spank prolog
> spank-prolog: debug:  Reading slurm.conf file:
> /home/localsoft/slurm/etc/slurm.conf
> spank-prolog: debug:  Running spank/prolog for jobid [754] uid [500]
> spank-prolog: debug:  spank: opening plugin stack
> /home/localsoft/slurm/etc/plugstack.conf
> slurmd: _run_prolog: run job script took usec=10434
> slurmd: _run_prolog: prolog with lock for job 754 ran for 0 seconds
> slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
> slurmd: debug3: slurmstepd rank 1 (acme12), parent rank 0 (acme11),
> children 0, depth 1, max_depth 1
> slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
> slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
> slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
> slurmd: debug4: unable to create link for
> /home/localsoft/slurm/spool//cred_state ->
> /home/localsoft/slurm/spool//cred_state.old: No such file or directory
> slurmd: debug:  task_p_slurmd_reserve_resources: 754 1
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6004
> slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
> slurmd: debug:  _rpc_signal_tasks: sending signal 995 to step 754.0 flag 0
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 5029
> slurmd: debug3: Entering _rpc_forward_data, address:
> /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
> slurmd: debug2: failed connecting to specified socket
> '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 5029
> slurmd: debug3: Entering _rpc_forward_data, address:
> /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
> slurmd: debug2: failed connecting to specified socket
> '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 5029
> slurmd: debug3: Entering _rpc_forward_data, address:
> /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
> slurmd: debug2: failed connecting to specified socket
> '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 5029
> slurmd: debug3: Entering _rpc_forward_data, address:
> /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
> slurmd: debug2: failed connecting to specified socket
> '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 5029
> slurmd: debug3: Entering _rpc_forward_data, address:
> /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
> slurmd: debug2: failed connecting to specified socket
> '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6004
> slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
> slurmd: debug:  _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6004
> slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
> slurmd: debug:  _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6011
> slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
> slurmd: debug:  _rpc_terminate_job, uid = 500
> slurmd: debug:  task_p_slurmd_release_resources: 754
> slurmd: debug3: state for jobid 744: ctime:1479371687 revoked:0 expires:0
> slurmd: debug3: state for jobid 745: ctime:1479371707 revoked:0 expires:0
> slurmd: debug3: state for jobid 746: ctime:1479371733 revoked:0 expires:0
> slurmd: debug3: state for jobid 747: ctime:1479371785 revoked:0 expires:0
> slurmd: debug3: state for jobid 748: ctime:1479374028 revoked:0 expires:0
> slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:1479374387
> expires:1479374387
> slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:0 expires:0
> slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0
> slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0
> slurmd: debug4: unable to create link for
> /home/localsoft/slurm/spool//cred_state ->
> /home/localsoft/slurm/spool//cred_state.old: File exists
> slurmd: debug4: unable to create link for
> /home/localsoft/slurm/spool//cred_state.new ->
> /home/localsoft/slurm/spool//cred_state: File exists
> slurmd: debug:  credential for job 754 revoked
> slurmd: debug2: No steps in jobid 754 to send signal 18
> slurmd: debug2: No steps in jobid 754 to send signal 15
> slurmd: debug4: sent SUCCESS
> slurmd: debug2: set revoke expiration for jobid 754 to 1479374560 UTS
> slurmd: debug:  Waiting for job 754's prolog to complete
> slurmd: debug:  Finished wait for job 754's prolog to complete
> slurmd: debug:  Calling /home/localsoft/slurm/sbin/slurmstepd spank epilog
> spank-epilog: debug:  Reading slurm.conf file:
> /home/localsoft/slurm/etc/slurm.conf
> spank-epilog: debug:  Running spank/epilog for jobid [754] uid [500]
> spank-epilog: debug:  spank: opening plugin stack
> /home/localsoft/slurm/etc/plugstack.conf
> slurmd: debug:  completed epilog for jobid 754
> slurmd: debug3: slurm_send_only_controller_msg: sent 192
> slurmd: debug:  Job 754: sent epilog complete msg: rc = 0
>
>
> As you can see, the problem seems to be these lines:
> slurmd: debug2: got this type of message 5029
> slurmd: debug3: Entering _rpc_forward_data, address:
> /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
> slurmd: debug2: failed connecting to specified socket
> '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
> slurmd: debug3: in the service_connection
>
>
> I have checked that these files exist in the shared storage and are
> accesible by the node complaining. They are however empty. Is this
> normal? What should I expect?
>
> $ ssh acme11 'ls -plah /home/localsoft/slurm/spool/'
> total 160K
> drwxr-xr-x  2 slurm slurm 4,0K nov 17 10:26 ./
> drwxr-xr-x 12 slurm slurm 4,0K nov 16 16:20 ../
> srwxrwxrwx  1 root  root     0 nov 17 10:26 acme11_755.0
> srwxrwxrwx  1 root  root     0 nov 17 10:26 acme12_755.0
> -rw-------  1 root  root   284 nov 17 10:26 cred_state.old
> -rw-------  1 slurm slurm 141K nov 16 14:24 slurmdbd.log
> -rw-r--r--  1 slurm slurm    5 nov 16 14:24 slurmdbd.pid
> srwxr-xr-x  1 root  root     0 nov 17 10:26 sock.pmi2.755.0
>
>
> So any ideas?
>
> thanks for your help,
>
> Manuel
>
>
> PS: About mvapich compilation.
>
> I made quite a few tests, and I ended up compiling with:
> ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
> --with-slurm=/home/localsoft/slurm
>
> Before that I tried the instructions
> in http://slurm.schedmd.com/mpi_guide.html#mvapich2
> <http://slurm.schedmd.com/mpi_guide.html#mvapich2> but if fails:
> ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
>  --with-pmi=pmi2  --with-pm=slurm
> (...)
> checking for slurm/pmi2.h... no
> configure: error: could not find slurm/pmi2.h.  Configure aborted
>
> I also tried
> ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
> --with-slurm=/home/localsoft/slurm --with-pmi=pmi2  --with-pm=slurm
> (...)
> checking whether we are cross compiling... configure: error: in
> `/root/mvapich2-2.2/src/mpi/romio':
> configure: error: cannot run C compiled programs.
> If you meant to cross compile, use `--host'.
> See `config.log' for more details
> configure: error: src/mpi/romio configure failed
>
>
Hi,

I think you really need both  "--with-pmi=pmi2 --with-pm=slurm" parameters to 
the configure command when building mvapich2. So you need to fix whatever 
issues is preventing it from finding slurm/pmi2.h (I have a vague recollection 
that at some point there was some problem with slurm makefiles not installing 
that file, or something like that).

On another note, it doesn't make sense to put special files like pipes or 
sockets on a network filesystem. At best it does no harm, but there might be 
problems if several nodes want to create, say, a socket special file at the 
same shared path.


-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to