Hi all, I keep having some issues using Slurm + mvapich2. It seems that I cannot correctly configure Slurm and mvapich2 to work together. In particular, sbatch works correctly but srun does not. Maybe someone here can provide me some guidance, as I suspect that the error is an obvious one, but I just cannot find it.
CONFIGURATION INFO: I am employing Slurm 17.02.0-0pre2 and mvapich 2.2. Mvapich is compiled with "--disable-mcast --with-slurm=<my slurm location>" <---there is a note about this at the bottom of the mail Slurm is compiled with no special options. After compilation, I executed "make && make install" in "contribs/pmi2/" (I read it somewhere) Slurm is configured with "MpiDefault=pmi2" in slurm.conf TESTS: I am executing a "helloWorldMPI" that displays a hello world message and writes down the node name for each MPI task. sbatch works perfectly: $ sbatch -n 2 --tasks-per-node=2 --wrap 'mpiexec ./helloWorldMPI' Submitted batch job 750 $ more slurm-750.out Process 0 of 2 is on acme12.ciemat.es Hello world from process 0 of 2 Process 1 of 2 is on acme12.ciemat.es Hello world from process 1 of 2 $sbatch -n 2 --tasks-per-node=1 -p debug --wrap 'mpiexec ./helloWorldMPI' Submitted batch job 748 $ more slurm-748.out Process 0 of 2 is on acme11.ciemat.es Hello world from process 0 of 2 Process 1 of 2 is on acme12.ciemat.es Hello world from process 1 of 2 However, srun fails. On a single node it works correctly: $ srun -n 2 --tasks-per-node=2 ./helloWorldMPI Process 0 of 2 is on acme11.ciemat.es Hello world from process 0 of 2 Process 1 of 2 is on acme11.ciemat.es Hello world from process 1 of 2 But when using more than one node, it fails. Below there is the experiment with a lot of debugging info, in case it helps. (note that the job ID will be different sometimes as this mail is the result of multiple submissions and copy/pastes) $ srun -n 2 --tasks-per-node=1 ./helloWorldMPI srun: error: mpi/pmi2: failed to send temp kvs to compute nodes slurmstepd: error: *** STEP 753.0 ON acme11 CANCELLED AT 2016-11-17T10:19:47 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: acme11: task 0: Killed srun: error: acme12: task 1: Killed Slurmctld output: slurmctld: debug2: Performing purge of old job records slurmctld: debug2: Performing full system state save slurmctld: debug3: Writing job id 753 to header record of job_state file slurmctld: debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=500 slurmctld: debug3: JobDesc: user_id=500 job_id=N/A partition=(null) name=helloWorldMPI slurmctld: debug3: cpus=2-4294967294 pn_min_cpus=-1 core_spec=-1 slurmctld: debug3: Nodes=1-[4294967294] Sock/Node=65534 Core/Sock=65534 Thread/Core=65534 slurmctld: debug3: pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1 slurmctld: debug3: immediate=0 features=(null) reservation=(null) slurmctld: debug3: req_nodes=(null) exc_nodes=(null) gres=(null) slurmctld: debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1 slurmctld: debug3: kill_on_node_fail=-1 script=(null) slurmctld: debug3: argv="./helloWorldMPI" slurmctld: debug3: stdin=(null) stdout=(null) stderr=(null) slurmctld: debug3: work_dir=/home/slurm/tests alloc_node:sid=acme31:11229 slurmctld: debug3: power_flags= slurmctld: debug3: resp_host=172.17.31.165 alloc_resp_port=56804 other_port=33290 slurmctld: debug3: dependency=(null) account=(null) qos=(null) comment=(null) slurmctld: debug3: mail_type=0 mail_user=(null) nice=0 num_tasks=2 open_mode=0 overcommit=-1 acctg_freq=(null) slurmctld: debug3: network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null) slurmctld: debug3: end_time= signal=0@0 wait_all_nodes=-1 cpu_freq= slurmctld: debug3: ntasks_per_node=1 ntasks_per_socket=-1 ntasks_per_core=-1 slurmctld: debug3: mem_bind=65534:(null) plane_size:65534 slurmctld: debug3: array_inx=(null) slurmctld: debug3: burst_buffer=(null) slurmctld: debug3: mcs_label=(null) slurmctld: debug3: deadline=Unknown slurmctld: debug3: bitflags=0 delay_boot=4294967294 slurmctld: debug3: User (null)(500) doesn't have a default account slurmctld: debug3: User (null)(500) doesn't have a default account slurmctld: debug3: found correct qos slurmctld: debug3: before alteration asking for nodes 1-4294967294 cpus 2-4294967294 slurmctld: debug3: after alteration asking for nodes 1-4294967294 cpus 2-4294967294 slurmctld: debug2: found 8 usable nodes from config containing acme[11-14,21-24] slurmctld: debug3: _pick_best_nodes: job 754 idle_nodes 8 share_nodes 8 slurmctld: debug5: powercapping: checking job 754 : skipped, capping disabled slurmctld: debug2: sched: JobId=754 allocated resources: NodeList=acme[11-12] slurmctld: sched: _slurm_rpc_allocate_resources JobId=754 NodeList=acme[11-12] usec=1340 slurmctld: debug3: Writing job id 754 to header record of job_state file slurmctld: debug2: _slurm_rpc_job_ready(754)=3 usec=4 slurmctld: debug3: StepDesc: user_id=500 job_id=754 node_count=2-2 cpu_count=2 num_tasks=2 slurmctld: debug3: cpu_freq_gov=4294967294 cpu_freq_max=4294967294 cpu_freq_min=4294967294 relative=65534 task_dist=0x1 plane=1 slurmctld: debug3: node_list=(null) constraints=(null) slurmctld: debug3: host=acme31 port=36711 srun_pid=8887 name=helloWorldMPI network=(null) exclusive=0 slurmctld: debug3: checkpoint-dir=/home/localsoft/slurm/checkpoint checkpoint_int=0 slurmctld: debug3: mem_per_node=0 resv_port_cnt=65534 immediate=0 no_kill=0 slurmctld: debug3: overcommit=0 time_limit=0 gres=(null) slurmctld: _pick_step_nodes: Configuration for job 754 is complete slurmctld: debug3: step_layout cpus = 16 pos = 0 slurmctld: debug3: step_layout cpus = 16 pos = 1 slurmctld: debug: laying out the 2 tasks on 2 hosts acme[11-12] dist 1 slurmctld: debug2: Testing job time limits and checkpoints slurmctld: debug2: Performing purge of old job records slurmctld: debug: sched: Running job scheduler slurmctld: debug3: Writing job id 754 to header record of job_state file slurmctld: debug2: Performing purge of old job records slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_SIGNAL_TASKS slurmctld: debug2: got 1 threads to send out slurmctld: debug2: got 1 threads to send out slurmctld: debug2: Tree head got back 0 looking for 2 slurmctld: debug3: Tree sending to acme11 slurmctld: debug3: Tree sending to acme12 slurmctld: debug3: slurm_send_only_node_msg: sent 181 slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of 10000 slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of 10000 slurmctld: debug3: Writing job id 754 to header record of job_state file slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_SIGNAL_TASKS slurmctld: debug2: got 1 threads to send out slurmctld: debug2: got 1 threads to send out slurmctld: debug2: Tree head got back 0 looking for 2 slurmctld: debug3: Tree sending to acme12 slurmctld: debug3: Tree sending to acme11 slurmctld: debug3: slurm_send_only_node_msg: sent 181 slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of 10000 slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of 10000 slurmctld: debug2: Tree head got back 1 slurmctld: debug2: Tree head got back 2 slurmctld: debug2: Tree head got back 1 slurmctld: debug2: Tree head got back 2 slurmctld: debug2: RPC to node acme12 failed, job not running slurmctld: debug2: RPC to node acme11 failed, job not running slurmctld: debug2: Processing RPC: REQUEST_COMPLETE_JOB_ALLOCATION from uid=500, JobId=754 rc=9 slurmctld: job_complete: JobID=754 State=0x1 NodeCnt=2 WTERMSIG 9 slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE slurmctld: debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE slurmctld: debug2: got 1 threads to send out slurmctld: debug3: User (null)(500) doesn't have a default account slurmctld: debug2: got 1 threads to send out slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB slurmctld: job_complete: JobID=754 State=0x8003 NodeCnt=2 done slurmctld: debug2: _slurm_rpc_complete_job_allocation: JobID=754 State=0x8003 NodeCnt=2 slurmctld: debug2: got 1 threads to send out slurmctld: debug3: Tree sending to acme11 slurmctld: debug3: slurm_send_only_node_msg: sent 181 slurmctld: debug2: Tree head got back 0 looking for 2 slurmctld: debug3: Tree sending to acme12 slurmctld: debug3: slurm_send_only_node_msg: sent 181 slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of 10000 slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of 10000 slurmctld: debug2: node_did_resp acme12 slurmctld: debug2: node_did_resp acme11 slurmctld: debug2: node_did_resp acme12 slurmctld: debug2: node_did_resp acme11 slurmctld: debug2: Tree head got back 1 slurmctld: debug2: Tree head got back 2 slurmctld: debug2: node_did_resp acme11 slurmctld: debug2: node_did_resp acme12 slurmctld: debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE uid=0 slurmctld: debug2: _slurm_rpc_epilog_complete: JobID=754 State=0x8003 NodeCnt=1 Node=acme12 slurmctld: debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE uid=0 slurmctld: debug2: _slurm_rpc_epilog_complete: JobID=754 State=0x3 NodeCnt=0 Node=acme11 slurmctld: debug: sched: Running job scheduler slurmctld: debug3: Writing job id 754 to header record of job_state file slurmctld: debug2: Performing purge of old job records slurmd (one node) slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6001 slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS slurmd: launch task 754.0 request from 500.1001@172.17.31.165 (port 38631) slurmd: debug3: state for jobid 744: ctime:1479371687 revoked:0 expires:0 slurmd: debug3: state for jobid 745: ctime:1479371707 revoked:0 expires:0 slurmd: debug3: state for jobid 751: ctime:1479374214 revoked:0 expires:0 slurmd: debug3: state for jobid 752: ctime:1479374335 revoked:1479374340 expires:1479374340 slurmd: debug3: state for jobid 752: ctime:1479374335 revoked:0 expires:0 slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:1479374387 expires:1479374387 slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:0 expires:0 slurmd: debug: Checking credential with 300 bytes of sig data slurmd: debug: task_p_slurmd_launch_request: 754.0 0 slurmd: debug: Calling /home/localsoft/slurm/sbin/slurmstepd spank prolog spank-prolog: debug: Reading slurm.conf file: /home/localsoft/slurm/etc/slurm.conf spank-prolog: debug: Running spank/prolog for jobid [754] uid [500] spank-prolog: debug: spank: opening plugin stack /home/localsoft/slurm/etc/plugstack.conf slurmd: _run_prolog: run job script took usec=11010 slurmd: _run_prolog: prolog with lock for job 754 ran for 0 seconds slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd slurmd: debug3: slurmstepd rank 0 (acme11), parent rank -1 (NONE), children 1, depth 0, max_depth 1 slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd slurmd: debug: task_p_slurmd_reserve_resources: 754 0 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6004 slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS slurmd: debug: _rpc_signal_tasks: sending signal 995 to step 754.0 flag 0 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 5029 slurmd: debug3: Entering _rpc_forward_data, address: /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6004 slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS slurmd: debug: _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6004 slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS slurmd: debug: _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 5016 slurmd: debug3: Entering _rpc_step_complete slurmd: debug: Entering stepd_completion for 754.0, range_first = 1, range_last = 1 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6011 slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB slurmd: debug: _rpc_terminate_job, uid = 500 slurmd: debug: task_p_slurmd_release_resources: 754 slurmd: debug3: state for jobid 744: ctime:1479371687 revoked:0 expires:0 slurmd: debug3: state for jobid 745: ctime:1479371707 revoked:0 expires:0 slurmd: debug3: state for jobid 751: ctime:1479374214 revoked:0 expires:0 slurmd: debug3: state for jobid 752: ctime:1479374335 revoked:1479374340 expires:1479374340 slurmd: debug3: state for jobid 752: ctime:1479374335 revoked:0 expires:0 slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:1479374387 expires:1479374387 slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:0 expires:0 slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0 slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0 slurmd: debug: credential for job 754 revoked slurmd: debug2: No steps in jobid 754 to send signal 18 slurmd: debug2: No steps in jobid 754 to send signal 15 slurmd: debug4: sent SUCCESS slurmd: debug2: set revoke expiration for jobid 754 to 1479374560 UTS slurmd: debug4: unable to create link for /home/localsoft/slurm/spool//cred_state -> /home/localsoft/slurm/spool//cred_state.old: No such file or directory slurmd: debug4: unable to create link for /home/localsoft/slurm/spool//cred_state.new -> /home/localsoft/slurm/spool//cred_state: No such file or directory slurmd: debug: Waiting for job 754's prolog to complete slurmd: debug: Finished wait for job 754's prolog to complete slurmd: debug: Calling /home/localsoft/slurm/sbin/slurmstepd spank epilog spank-epilog: debug: Reading slurm.conf file: /home/localsoft/slurm/etc/slurm.conf spank-epilog: debug: Running spank/epilog for jobid [754] uid [500] spank-epilog: debug: spank: opening plugin stack /home/localsoft/slurm/etc/plugstack.conf slurmd: debug: completed epilog for jobid 754 slurmd: debug3: slurm_send_only_controller_msg: sent 192 slurmd: debug: Job 754: sent epilog complete msg: rc = 0 slurmd (other node) slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6001 slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS slurmd: launch task 754.0 request from 500.1001@172.17.31.165 (port 47784) slurmd: debug3: state for jobid 744: ctime:1479371687 revoked:0 expires:0 slurmd: debug3: state for jobid 745: ctime:1479371707 revoked:0 expires:0 slurmd: debug3: state for jobid 746: ctime:1479371733 revoked:0 expires:0 slurmd: debug3: state for jobid 747: ctime:1479371785 revoked:0 expires:0 slurmd: debug3: state for jobid 748: ctime:1479374028 revoked:0 expires:0 slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:1479374387 expires:1479374387 slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:0 expires:0 slurmd: debug: Checking credential with 300 bytes of sig data slurmd: debug: task_p_slurmd_launch_request: 754.0 1 slurmd: debug: Calling /home/localsoft/slurm/sbin/slurmstepd spank prolog spank-prolog: debug: Reading slurm.conf file: /home/localsoft/slurm/etc/slurm.conf spank-prolog: debug: Running spank/prolog for jobid [754] uid [500] spank-prolog: debug: spank: opening plugin stack /home/localsoft/slurm/etc/plugstack.conf slurmd: _run_prolog: run job script took usec=10434 slurmd: _run_prolog: prolog with lock for job 754 ran for 0 seconds slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd slurmd: debug3: slurmstepd rank 1 (acme12), parent rank 0 (acme11), children 0, depth 1, max_depth 1 slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd slurmd: debug4: unable to create link for /home/localsoft/slurm/spool//cred_state -> /home/localsoft/slurm/spool//cred_state.old: No such file or directory slurmd: debug: task_p_slurmd_reserve_resources: 754 1 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6004 slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS slurmd: debug: _rpc_signal_tasks: sending signal 995 to step 754.0 flag 0 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 5029 slurmd: debug3: Entering _rpc_forward_data, address: /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66 slurmd: debug2: failed connecting to specified socket '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 5029 slurmd: debug3: Entering _rpc_forward_data, address: /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66 slurmd: debug2: failed connecting to specified socket '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 5029 slurmd: debug3: Entering _rpc_forward_data, address: /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66 slurmd: debug2: failed connecting to specified socket '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 5029 slurmd: debug3: Entering _rpc_forward_data, address: /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66 slurmd: debug2: failed connecting to specified socket '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 5029 slurmd: debug3: Entering _rpc_forward_data, address: /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66 slurmd: debug2: failed connecting to specified socket '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6004 slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS slurmd: debug: _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6004 slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS slurmd: debug: _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6011 slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB slurmd: debug: _rpc_terminate_job, uid = 500 slurmd: debug: task_p_slurmd_release_resources: 754 slurmd: debug3: state for jobid 744: ctime:1479371687 revoked:0 expires:0 slurmd: debug3: state for jobid 745: ctime:1479371707 revoked:0 expires:0 slurmd: debug3: state for jobid 746: ctime:1479371733 revoked:0 expires:0 slurmd: debug3: state for jobid 747: ctime:1479371785 revoked:0 expires:0 slurmd: debug3: state for jobid 748: ctime:1479374028 revoked:0 expires:0 slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:1479374387 expires:1479374387 slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:0 expires:0 slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0 slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0 slurmd: debug4: unable to create link for /home/localsoft/slurm/spool//cred_state -> /home/localsoft/slurm/spool//cred_state.old: File exists slurmd: debug4: unable to create link for /home/localsoft/slurm/spool//cred_state.new -> /home/localsoft/slurm/spool//cred_state: File exists slurmd: debug: credential for job 754 revoked slurmd: debug2: No steps in jobid 754 to send signal 18 slurmd: debug2: No steps in jobid 754 to send signal 15 slurmd: debug4: sent SUCCESS slurmd: debug2: set revoke expiration for jobid 754 to 1479374560 UTS slurmd: debug: Waiting for job 754's prolog to complete slurmd: debug: Finished wait for job 754's prolog to complete slurmd: debug: Calling /home/localsoft/slurm/sbin/slurmstepd spank epilog spank-epilog: debug: Reading slurm.conf file: /home/localsoft/slurm/etc/slurm.conf spank-epilog: debug: Running spank/epilog for jobid [754] uid [500] spank-epilog: debug: spank: opening plugin stack /home/localsoft/slurm/etc/plugstack.conf slurmd: debug: completed epilog for jobid 754 slurmd: debug3: slurm_send_only_controller_msg: sent 192 slurmd: debug: Job 754: sent epilog complete msg: rc = 0 As you can see, the problem seems to be these lines: slurmd: debug2: got this type of message 5029 slurmd: debug3: Entering _rpc_forward_data, address: /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66 slurmd: debug2: failed connecting to specified socket '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle slurmd: debug3: in the service_connection I have checked that these files exist in the shared storage and are accesible by the node complaining. They are however empty. Is this normal? What should I expect? $ ssh acme11 'ls -plah /home/localsoft/slurm/spool/' total 160K drwxr-xr-x 2 slurm slurm 4,0K nov 17 10:26 ./ drwxr-xr-x 12 slurm slurm 4,0K nov 16 16:20 ../ srwxrwxrwx 1 root root 0 nov 17 10:26 acme11_755.0 srwxrwxrwx 1 root root 0 nov 17 10:26 acme12_755.0 -rw------- 1 root root 284 nov 17 10:26 cred_state.old -rw------- 1 slurm slurm 141K nov 16 14:24 slurmdbd.log -rw-r--r-- 1 slurm slurm 5 nov 16 14:24 slurmdbd.pid srwxr-xr-x 1 root root 0 nov 17 10:26 sock.pmi2.755.0 So any ideas? thanks for your help, Manuel PS: About mvapich compilation. I made quite a few tests, and I ended up compiling with: ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast --with-slurm=/home/localsoft/slurm Before that I tried the instructions in http://slurm.schedmd.com/mp i_guide.html#mvapich2 but if fails: ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast --with-pmi=pmi2 --with-pm=slurm (...) checking for slurm/pmi2.h... no configure: error: could not find slurm/pmi2.h. Configure aborted I also tried ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast --with-slurm=/home/localsoft/slurm --with-pmi=pmi2 --with-pm=slurm (...) checking whether we are cross compiling... configure: error: in `/root/mvapich2-2.2/src/mpi/romio': configure: error: cannot run C compiled programs. If you meant to cross compile, use `--host'. See `config.log' for more details configure: error: src/mpi/romio configure failed