Hi all,

I keep having some issues using Slurm + mvapich2. It seems that I cannot
correctly configure Slurm and mvapich2 to work together. In particular,
sbatch works correctly but srun does not.  Maybe someone here can provide
me some guidance, as I suspect that the error is an obvious one, but I just
cannot find it.

CONFIGURATION INFO:
I am employing Slurm 17.02.0-0pre2 and mvapich 2.2.
Mvapich is compiled with "--disable-mcast --with-slurm=<my slurm location>"
 <---there is a note about this at the bottom of the mail
Slurm is compiled with no special options. After compilation, I executed
"make && make install" in "contribs/pmi2/" (I read it somewhere)
Slurm is configured with "MpiDefault=pmi2" in slurm.conf

TESTS:
I am executing a "helloWorldMPI" that displays a hello world message and
writes down the node name for each MPI task.

sbatch works perfectly:

$ sbatch -n 2 --tasks-per-node=2 --wrap 'mpiexec  ./helloWorldMPI'
Submitted batch job 750

$ more slurm-750.out
Process 0 of 2 is on acme12.ciemat.es
Hello world from process 0 of 2
Process 1 of 2 is on acme12.ciemat.es
Hello world from process 1 of 2

$sbatch -n 2 --tasks-per-node=1 -p debug --wrap 'mpiexec  ./helloWorldMPI'
Submitted batch job 748

$ more slurm-748.out
Process 0 of 2 is on acme11.ciemat.es
Hello world from process 0 of 2
Process 1 of 2 is on acme12.ciemat.es
Hello world from process 1 of 2


However, srun fails.
On a single node it works correctly:
$ srun -n 2 --tasks-per-node=2   ./helloWorldMPI
Process 0 of 2 is on acme11.ciemat.es
Hello world from process 0 of 2
Process 1 of 2 is on acme11.ciemat.es
Hello world from process 1 of 2

But when using more than one node, it fails. Below there is the experiment
with a lot of debugging info, in case it helps.

(note that the job ID will be different sometimes as this mail is the
result of multiple submissions and copy/pastes)

$ srun -n 2 --tasks-per-node=1   ./helloWorldMPI
srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
slurmstepd: error: *** STEP 753.0 ON acme11 CANCELLED AT
2016-11-17T10:19:47 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: acme11: task 0: Killed
srun: error: acme12: task 1: Killed


Slurmctld output:
slurmctld: debug2: Performing purge of old job records
slurmctld: debug2: Performing full system state save
slurmctld: debug3: Writing job id 753 to header record of job_state file
slurmctld: debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from
uid=500
slurmctld: debug3: JobDesc: user_id=500 job_id=N/A partition=(null)
name=helloWorldMPI
slurmctld: debug3:    cpus=2-4294967294 pn_min_cpus=-1 core_spec=-1
slurmctld: debug3:    Nodes=1-[4294967294] Sock/Node=65534 Core/Sock=65534
Thread/Core=65534
slurmctld: debug3:    pn_min_memory_job=18446744073709551615
pn_min_tmp_disk=-1
slurmctld: debug3:    immediate=0 features=(null) reservation=(null)
slurmctld: debug3:    req_nodes=(null) exc_nodes=(null) gres=(null)
slurmctld: debug3:    time_limit=-1--1 priority=-1 contiguous=0 shared=-1
slurmctld: debug3:    kill_on_node_fail=-1 script=(null)
slurmctld: debug3:    argv="./helloWorldMPI"
slurmctld: debug3:    stdin=(null) stdout=(null) stderr=(null)
slurmctld: debug3:    work_dir=/home/slurm/tests alloc_node:sid=acme31:11229
slurmctld: debug3:    power_flags=
slurmctld: debug3:    resp_host=172.17.31.165 alloc_resp_port=56804
other_port=33290
slurmctld: debug3:    dependency=(null) account=(null) qos=(null)
comment=(null)
slurmctld: debug3:    mail_type=0 mail_user=(null) nice=0 num_tasks=2
open_mode=0 overcommit=-1 acctg_freq=(null)
slurmctld: debug3:    network=(null) begin=Unknown cpus_per_task=-1
requeue=-1 licenses=(null)
slurmctld: debug3:    end_time= signal=0@0 wait_all_nodes=-1 cpu_freq=
slurmctld: debug3:    ntasks_per_node=1 ntasks_per_socket=-1
ntasks_per_core=-1
slurmctld: debug3:    mem_bind=65534:(null) plane_size:65534
slurmctld: debug3:    array_inx=(null)
slurmctld: debug3:    burst_buffer=(null)
slurmctld: debug3:    mcs_label=(null)
slurmctld: debug3:    deadline=Unknown
slurmctld: debug3:    bitflags=0 delay_boot=4294967294
slurmctld: debug3: User (null)(500) doesn't have a default account
slurmctld: debug3: User (null)(500) doesn't have a default account
slurmctld: debug3: found correct qos
slurmctld: debug3: before alteration asking for nodes 1-4294967294 cpus
2-4294967294
slurmctld: debug3: after alteration asking for nodes 1-4294967294 cpus
2-4294967294
slurmctld: debug2: found 8 usable nodes from config containing
acme[11-14,21-24]
slurmctld: debug3: _pick_best_nodes: job 754 idle_nodes 8 share_nodes 8
slurmctld: debug5: powercapping: checking job 754 : skipped, capping
disabled
slurmctld: debug2: sched: JobId=754 allocated resources:
NodeList=acme[11-12]
slurmctld: sched: _slurm_rpc_allocate_resources JobId=754
NodeList=acme[11-12] usec=1340
slurmctld: debug3: Writing job id 754 to header record of job_state file
slurmctld: debug2: _slurm_rpc_job_ready(754)=3 usec=4
slurmctld: debug3: StepDesc: user_id=500 job_id=754 node_count=2-2
cpu_count=2 num_tasks=2
slurmctld: debug3:    cpu_freq_gov=4294967294 cpu_freq_max=4294967294
cpu_freq_min=4294967294 relative=65534 task_dist=0x1 plane=1
slurmctld: debug3:    node_list=(null)  constraints=(null)
slurmctld: debug3:    host=acme31 port=36711 srun_pid=8887
name=helloWorldMPI network=(null) exclusive=0
slurmctld: debug3:    checkpoint-dir=/home/localsoft/slurm/checkpoint
checkpoint_int=0
slurmctld: debug3:    mem_per_node=0 resv_port_cnt=65534 immediate=0
no_kill=0
slurmctld: debug3:    overcommit=0 time_limit=0 gres=(null)
slurmctld: _pick_step_nodes: Configuration for job 754 is complete
slurmctld: debug3: step_layout cpus = 16 pos = 0
slurmctld: debug3: step_layout cpus = 16 pos = 1
slurmctld: debug:  laying out the 2 tasks on 2 hosts acme[11-12] dist 1
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug3: Writing job id 754 to header record of job_state file
slurmctld: debug2: Performing purge of old job records
slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_SIGNAL_TASKS
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 2
slurmctld: debug3: Tree sending to acme11
slurmctld: debug3: Tree sending to acme12
slurmctld: debug3: slurm_send_only_node_msg: sent 181
slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of
10000
slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of
10000
slurmctld: debug3: Writing job id 754 to header record of job_state file
slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_SIGNAL_TASKS
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 2
slurmctld: debug3: Tree sending to acme12
slurmctld: debug3: Tree sending to acme11
slurmctld: debug3: slurm_send_only_node_msg: sent 181
slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of
10000
slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of
10000
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got back 2
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got back 2
slurmctld: debug2: RPC to node acme12 failed, job not running
slurmctld: debug2: RPC to node acme11 failed, job not running
slurmctld: debug2: Processing RPC: REQUEST_COMPLETE_JOB_ALLOCATION from
uid=500, JobId=754 rc=9
slurmctld: job_complete: JobID=754 State=0x1 NodeCnt=2 WTERMSIG 9
slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE
slurmctld: debug2: got 1 threads to send out
slurmctld: debug3: User (null)(500) doesn't have a default account
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
slurmctld: job_complete: JobID=754 State=0x8003 NodeCnt=2 done
slurmctld: debug2: _slurm_rpc_complete_job_allocation: JobID=754
State=0x8003 NodeCnt=2
slurmctld: debug2: got 1 threads to send out
slurmctld: debug3: Tree sending to acme11
slurmctld: debug3: slurm_send_only_node_msg: sent 181
slurmctld: debug2: Tree head got back 0 looking for 2
slurmctld: debug3: Tree sending to acme12
slurmctld: debug3: slurm_send_only_node_msg: sent 181
slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of
10000
slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of
10000
slurmctld: debug2: node_did_resp acme12
slurmctld: debug2: node_did_resp acme11
slurmctld: debug2: node_did_resp acme12
slurmctld: debug2: node_did_resp acme11
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got back 2
slurmctld: debug2: node_did_resp acme11
slurmctld: debug2: node_did_resp acme12
slurmctld: debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE uid=0
slurmctld: debug2: _slurm_rpc_epilog_complete: JobID=754 State=0x8003
NodeCnt=1 Node=acme12
slurmctld: debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE uid=0
slurmctld: debug2: _slurm_rpc_epilog_complete: JobID=754 State=0x3
NodeCnt=0 Node=acme11
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug3: Writing job id 754 to header record of job_state file
slurmctld: debug2: Performing purge of old job records



slurmd (one node)
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6001
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: launch task 754.0 request from 500.1001@172.17.31.165 (port 38631)
slurmd: debug3: state for jobid 744: ctime:1479371687 revoked:0 expires:0
slurmd: debug3: state for jobid 745: ctime:1479371707 revoked:0 expires:0
slurmd: debug3: state for jobid 751: ctime:1479374214 revoked:0 expires:0
slurmd: debug3: state for jobid 752: ctime:1479374335 revoked:1479374340
expires:1479374340
slurmd: debug3: state for jobid 752: ctime:1479374335 revoked:0 expires:0
slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:1479374387
expires:1479374387
slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:0 expires:0
slurmd: debug:  Checking credential with 300 bytes of sig data
slurmd: debug:  task_p_slurmd_launch_request: 754.0 0
slurmd: debug:  Calling /home/localsoft/slurm/sbin/slurmstepd spank prolog
spank-prolog: debug:  Reading slurm.conf file:
/home/localsoft/slurm/etc/slurm.conf
spank-prolog: debug:  Running spank/prolog for jobid [754] uid [500]
spank-prolog: debug:  spank: opening plugin stack
/home/localsoft/slurm/etc/plugstack.conf
slurmd: _run_prolog: run job script took usec=11010
slurmd: _run_prolog: prolog with lock for job 754 ran for 0 seconds
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank 0 (acme11), parent rank -1 (NONE), children
1, depth 0, max_depth 1
slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug:  task_p_slurmd_reserve_resources: 754 0
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  _rpc_signal_tasks: sending signal 995 to step 754.0 flag 0
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5029
slurmd: debug3: Entering _rpc_forward_data, address:
/home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5016
slurmd: debug3: Entering _rpc_step_complete
slurmd: debug:  Entering stepd_completion for 754.0, range_first = 1,
range_last = 1
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6011
slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd: debug:  _rpc_terminate_job, uid = 500
slurmd: debug:  task_p_slurmd_release_resources: 754
slurmd: debug3: state for jobid 744: ctime:1479371687 revoked:0 expires:0
slurmd: debug3: state for jobid 745: ctime:1479371707 revoked:0 expires:0
slurmd: debug3: state for jobid 751: ctime:1479374214 revoked:0 expires:0
slurmd: debug3: state for jobid 752: ctime:1479374335 revoked:1479374340
expires:1479374340
slurmd: debug3: state for jobid 752: ctime:1479374335 revoked:0 expires:0
slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:1479374387
expires:1479374387
slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:0 expires:0
slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0
slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0
slurmd: debug:  credential for job 754 revoked
slurmd: debug2: No steps in jobid 754 to send signal 18
slurmd: debug2: No steps in jobid 754 to send signal 15
slurmd: debug4: sent SUCCESS
slurmd: debug2: set revoke expiration for jobid 754 to 1479374560 UTS
slurmd: debug4: unable to create link for
/home/localsoft/slurm/spool//cred_state
-> /home/localsoft/slurm/spool//cred_state.old: No such file or directory
slurmd: debug4: unable to create link for
/home/localsoft/slurm/spool//cred_state.new
-> /home/localsoft/slurm/spool//cred_state: No such file or directory
slurmd: debug:  Waiting for job 754's prolog to complete
slurmd: debug:  Finished wait for job 754's prolog to complete
slurmd: debug:  Calling /home/localsoft/slurm/sbin/slurmstepd spank epilog
spank-epilog: debug:  Reading slurm.conf file:
/home/localsoft/slurm/etc/slurm.conf
spank-epilog: debug:  Running spank/epilog for jobid [754] uid [500]
spank-epilog: debug:  spank: opening plugin stack
/home/localsoft/slurm/etc/plugstack.conf
slurmd: debug:  completed epilog for jobid 754
slurmd: debug3: slurm_send_only_controller_msg: sent 192
slurmd: debug:  Job 754: sent epilog complete msg: rc = 0



slurmd (other node)
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6001
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: launch task 754.0 request from 500.1001@172.17.31.165 (port 47784)
slurmd: debug3: state for jobid 744: ctime:1479371687 revoked:0 expires:0
slurmd: debug3: state for jobid 745: ctime:1479371707 revoked:0 expires:0
slurmd: debug3: state for jobid 746: ctime:1479371733 revoked:0 expires:0
slurmd: debug3: state for jobid 747: ctime:1479371785 revoked:0 expires:0
slurmd: debug3: state for jobid 748: ctime:1479374028 revoked:0 expires:0
slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:1479374387
expires:1479374387
slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:0 expires:0
slurmd: debug:  Checking credential with 300 bytes of sig data
slurmd: debug:  task_p_slurmd_launch_request: 754.0 1
slurmd: debug:  Calling /home/localsoft/slurm/sbin/slurmstepd spank prolog
spank-prolog: debug:  Reading slurm.conf file:
/home/localsoft/slurm/etc/slurm.conf
spank-prolog: debug:  Running spank/prolog for jobid [754] uid [500]
spank-prolog: debug:  spank: opening plugin stack
/home/localsoft/slurm/etc/plugstack.conf
slurmd: _run_prolog: run job script took usec=10434
slurmd: _run_prolog: prolog with lock for job 754 ran for 0 seconds
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank 1 (acme12), parent rank 0 (acme11),
children 0, depth 1, max_depth 1
slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug4: unable to create link for
/home/localsoft/slurm/spool//cred_state
-> /home/localsoft/slurm/spool//cred_state.old: No such file or directory
slurmd: debug:  task_p_slurmd_reserve_resources: 754 1
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  _rpc_signal_tasks: sending signal 995 to step 754.0 flag 0
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5029
slurmd: debug3: Entering _rpc_forward_data, address:
/home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
slurmd: debug2: failed connecting to specified socket
'/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5029
slurmd: debug3: Entering _rpc_forward_data, address:
/home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
slurmd: debug2: failed connecting to specified socket
'/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5029
slurmd: debug3: Entering _rpc_forward_data, address:
/home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
slurmd: debug2: failed connecting to specified socket
'/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5029
slurmd: debug3: Entering _rpc_forward_data, address:
/home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
slurmd: debug2: failed connecting to specified socket
'/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5029
slurmd: debug3: Entering _rpc_forward_data, address:
/home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
slurmd: debug2: failed connecting to specified socket
'/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6011
slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd: debug:  _rpc_terminate_job, uid = 500
slurmd: debug:  task_p_slurmd_release_resources: 754
slurmd: debug3: state for jobid 744: ctime:1479371687 revoked:0 expires:0
slurmd: debug3: state for jobid 745: ctime:1479371707 revoked:0 expires:0
slurmd: debug3: state for jobid 746: ctime:1479371733 revoked:0 expires:0
slurmd: debug3: state for jobid 747: ctime:1479371785 revoked:0 expires:0
slurmd: debug3: state for jobid 748: ctime:1479374028 revoked:0 expires:0
slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:1479374387
expires:1479374387
slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:0 expires:0
slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0
slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0
slurmd: debug4: unable to create link for
/home/localsoft/slurm/spool//cred_state
-> /home/localsoft/slurm/spool//cred_state.old: File exists
slurmd: debug4: unable to create link for
/home/localsoft/slurm/spool//cred_state.new
-> /home/localsoft/slurm/spool//cred_state: File exists
slurmd: debug:  credential for job 754 revoked
slurmd: debug2: No steps in jobid 754 to send signal 18
slurmd: debug2: No steps in jobid 754 to send signal 15
slurmd: debug4: sent SUCCESS
slurmd: debug2: set revoke expiration for jobid 754 to 1479374560 UTS
slurmd: debug:  Waiting for job 754's prolog to complete
slurmd: debug:  Finished wait for job 754's prolog to complete
slurmd: debug:  Calling /home/localsoft/slurm/sbin/slurmstepd spank epilog
spank-epilog: debug:  Reading slurm.conf file:
/home/localsoft/slurm/etc/slurm.conf
spank-epilog: debug:  Running spank/epilog for jobid [754] uid [500]
spank-epilog: debug:  spank: opening plugin stack
/home/localsoft/slurm/etc/plugstack.conf
slurmd: debug:  completed epilog for jobid 754
slurmd: debug3: slurm_send_only_controller_msg: sent 192
slurmd: debug:  Job 754: sent epilog complete msg: rc = 0


As you can see, the problem seems to be these lines:
slurmd: debug2: got this type of message 5029
slurmd: debug3: Entering _rpc_forward_data, address:
/home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
slurmd: debug2: failed connecting to specified socket
'/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
slurmd: debug3: in the service_connection


I have checked that these files exist in the shared storage and are
accesible by the node complaining. They are however empty. Is this normal?
What should I expect?

$ ssh acme11 'ls -plah /home/localsoft/slurm/spool/'
total 160K
drwxr-xr-x  2 slurm slurm 4,0K nov 17 10:26 ./
drwxr-xr-x 12 slurm slurm 4,0K nov 16 16:20 ../
srwxrwxrwx  1 root  root     0 nov 17 10:26 acme11_755.0
srwxrwxrwx  1 root  root     0 nov 17 10:26 acme12_755.0
-rw-------  1 root  root   284 nov 17 10:26 cred_state.old
-rw-------  1 slurm slurm 141K nov 16 14:24 slurmdbd.log
-rw-r--r--  1 slurm slurm    5 nov 16 14:24 slurmdbd.pid
srwxr-xr-x  1 root  root     0 nov 17 10:26 sock.pmi2.755.0


So any ideas?

thanks for your help,

Manuel


PS: About mvapich compilation.

I made quite a few tests, and I ended up compiling with:
./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
--with-slurm=/home/localsoft/slurm

Before that I tried the instructions in http://slurm.schedmd.com/mp
i_guide.html#mvapich2 but if fails:
./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
 --with-pmi=pmi2  --with-pm=slurm
(...)
checking for slurm/pmi2.h... no
configure: error: could not find slurm/pmi2.h.  Configure aborted

I also tried
./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
--with-slurm=/home/localsoft/slurm --with-pmi=pmi2  --with-pm=slurm
(...)
checking whether we are cross compiling... configure: error: in
`/root/mvapich2-2.2/src/mpi/romio':
configure: error: cannot run C compiled programs.
If you meant to cross compile, use `--host'.
See `config.log' for more details
configure: error: src/mpi/romio configure failed

Reply via email to