Hi all, I am facing a quite weird error on the last version of slurm (16.05.0-0pre1). System crashes when executing srun.
So, I have 2 experimental testbeds. One is based on virtual machines, and one is a physical one. Both clusters, pgusical and virtual, run -OS: CentOS7, updated - MPICH Version: 3.1.4 - slurm 16.05.0-0pre1 - munge-0.5.11 - slurm.conf configured with "MpiDefault=pmi2" I have a test helloWorldMPI application. So, in the virtual cluster, the application can be executed with --- ---- srun -n 2 --cpus-per-task=1 --ntasks-per-node=1 ./helloWorldMPI sbatch -n 2 --cpus-per-task=1 --ntasks-per-node=1 helloWorldMPI.sh (a script with a single line, "mpiexec helloWorldMPI"- --- --- both work OK. However, in the physical cluster, I can run the sbatch command, but the srun one crashes. --- --- -bash-4.2$ srun --version slurm 16.05.0-0pre1 -bash-4.2$ srun -n 2 --cpus-per-task=1 --ntasks-per-node=1 ./helloWorldMPI *** Error in `srun': free(): invalid pointer: 0x00007fc1ff774ed0 *** ======= Backtrace: ========= /lib64/libc.so.6(+0x7d1fd)[0x7fc2000191fd] srun(slurm_xfree+0x49)[0x442ce6] srun(slurm_free_forward_data_msg+0x34)[0x4c0a34] srun(slurm_free_msg_data+0xc70)[0x4c66b6] srun(slurm_free_msg+0x53)[0x4864ae] /home/localsoft/slurm/lib/slurm/mpi_pmi2.so(tree_msg_to_stepds+0x189)[0x7fc1ff56cbc3] /home/localsoft/slurm/lib/slurm/mpi_pmi2.so(temp_kvs_send+0xd7)[0x7fc1ff563bfc] /home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0xf18d)[0x7fc1ff56b18d] /home/localsoft/slurm/lib/slurm/mpi_pmi2.so(handle_tree_cmd+0x49d)[0x7fc1ff56c601] /home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0x556f)[0x7fc1ff56156f] /home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0x5760)[0x7fc1ff561760] srun[0x428b58] srun[0x42891e] srun(eio_handle_mainloop+0x1b0)[0x428528] /home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0x5b41)[0x7fc1ff561b41] /lib64/libpthread.so.0(+0x7df5)[0x7fc200364df5] /lib64/libc.so.6(clone+0x6d)[0x7fc2000921ad] ======= Memory map: ======== 00400000-005c2000 r-xp 00000000 00:22 26456 /home/localsoft/slurm/bin/srun 007c1000-007c2000 r--p 001c1000 00:22 26456 /home/localsoft/slurm/bin/srun 007c2000-007c9000 rw-p 001c2000 00:22 26456 /home/localsoft/slurm/bin/srun 007c9000-007cf000 rw-p 00000000 00:00 0 02628000-0288d000 rw-p 00000000 00:00 0 [heap] 7fc1e0000000-7fc1e0021000 rw-p 00000000 00:00 0 7fc1e0021000-7fc1e4000000 ---p 00000000 00:00 0 7fc1e8000000-7fc1e8021000 rw-p 00000000 00:00 0 7fc1e8021000-7fc1ec000000 ---p 00000000 00:00 0 7fc1ec000000-7fc1ec021000 rw-p 00000000 00:00 0 7fc1ec021000-7fc1f0000000 ---p 00000000 00:00 0 7fc1f0000000-7fc1f0021000 rw-p 00000000 00:00 0 7fc1f0021000-7fc1f4000000 ---p 00000000 00:00 0 7fc1f4000000-7fc1f4021000 rw-p 00000000 00:00 0 7fc1f4021000-7fc1f8000000 ---p 00000000 00:00 0 7fc1f8000000-7fc1f8021000 rw-p 00000000 00:00 0 7fc1f8021000-7fc1fc000000 ---p 00000000 00:00 0 7fc1fe030000-7fc1fe045000 r-xp 00000000 08:17 67109001 /usr/lib64/libgcc_s-4.8.3-20140911.so.1 7fc1fe045000-7fc1fe244000 ---p 00015000 08:17 67109001 /usr/lib64/libgcc_s-4.8.3-20140911.so.1 7fc1fe244000-7fc1fe245000 r--p 00014000 08:17 67109001 /usr/lib64/libgcc_s-4.8.3-20140911.so.1 7fc1fe245000-7fc1fe246000 rw-p 00015000 08:17 67109001 /usr/lib64/libgcc_s-4.8.3-20140911.so.1 7fc1fe246000-7fc1fe247000 ---p 00000000 00:00 0 7fc1fe247000-7fc1fe347000 rw-p 00000000 00:00 0 7fc1fe347000-7fc1fe348000 ---p 00000000 00:00 0 7fc1fe348000-7fc1fe448000 rw-p 00000000 00:00 0 7fc1fe448000-7fc1fe449000 r-xp 00000000 00:22 21727 /home/localsoft/slurm/lib/slurm/route_default.so 7fc1fe449000-7fc1fe648000 ---p 00001000 00:22 21727 /home/localsoft/slurm/lib/slurm/route_default.so 7fc1fe648000-7fc1fe649000 r--p 00000000 00:22 21727 /home/localsoft/slurm/lib/slurm/route_default.so 7fc1fe649000-7fc1fe64a000 rw-p 00001000 00:22 21727 /home/localsoft/slurm/lib/slurm/route_default.so 7fc1fe64a000-7fc1fe64b000 ---p 00000000 00:00 0 7fc1fe64b000-7fc1fe74b000 rw-p 00000000 00:00 0 [stack:16505] 7fc1fe74b000-7fc1fe74c000 ---p 00000000 00:00 0 7fc1fe74c000-7fc1fe84c000 rw-p 00000000 00:00 0 [stack:16504] 7fc1fe84c000-7fc1fe84d000 ---p 00000000 00:00 0 7fc1fe84d000-7fc1ff04d000 rw-p 00000000 00:00 0 [stack:16503] 7fc1ff04d000-7fc1ff04e000 ---p 00000000 00:00 0 7fc1ff04e000-7fc1ff14e000 rw-p 00000000 00:00 0 [stack:16502] 7fc1ff14e000-7fc1ff157000 r-xp 00000000 08:17 67390882 /usr/lib64/libmunge.so.2.0.0 7fc1ff157000-7fc1ff356000 ---p 00009000 08:17 67390882 /usr/lib64/libmunge.so.2.0.0 7fc1ff356000-7fc1ff357000 r--p 00008000 08:17 67390882 /usr/lib64/libmunge.so.2.0.0 7fc1ff357000-7fc1ff358000 rw-p 00009000 08:17 67390882 /usr/lib64/libmunge.so.2.0.0 7fc1ff358000-7fc1ff35b000 r-xp 00000000 00:22 1228 /home/localsoft/slurm/lib/slurm/auth_munge.so 7fc1ff35b000-7fc1ff55a000 ---p 00003000 00:22 1228 /home/localsoft/slurm/lib/slurm/auth_munge.so 7fc1ff55a000-7fc1ff55b000 r--p 00002000 00:22 1228 /home/localsoft/slurm/lib/slurm/auth_munge.so 7fc1ff55b000-7fc1ff55c000 rw-p 00003000 00:22 1228 /home/localsoft/slurm/lib/slurm/auth_munge.so 7fc1ff55c000-7fc1ff574000 r-xp 00000000 00:22 29464 /home/localsoft/slurm/lib/slurm/mpi_pmi2.so 7fc1ff574000-7fc1ff773000 ---p 00018000 00:22 29464 /home/localsoft/slurm/lib/slurm/mpi_pmi2.so 7fc1ff773000-7fc1ff774000 r--p 00017000 00:22 29464 /home/localsoft/slurm/lib/slurm/mpi_pmi2.so 7fc1ff774000-7fc1ff775000 rw-p 00018000 00:22 29464 /home/localsoft/slurm/lib/slurm/mpi_pmi2.so 7fc1ff775000-7fc1ff77a000 r-xp 00000000 00:22 1057 /home/localsoft/slurm/lib/slurm/launch_slurm.so 7fc1ff77a000-7fc1ff97a000 ---p 00005000 00:22 1057 /home/localsoft/slurm/lib/slurm/launch_slurm.so 7fc1ff97a000-7fc1ff97b000 r--p 00005000 00:22 1057 /home/localsoft/slurm/lib/slurm/launch_slurm.so 7fc1ff97b000-7fc1ff97c000 rw-p 00006000 00:22 1057 /home/localsoft/slurm/lib/slurm/launch_slurm.so 7fc1ff97c000-7fc1ff97e000 r-xp 00000000 00:22 23344 /home/localsoft/slurm/lib/slurm/switch_none.so 7fc1ff97e000-7fc1ffb7e000 ---p 00002000 00:22 23344 /home/localsoft/slurm/lib/slurm/switch_none.so 7fc1ffb7e000-7fc1ffb7f000 r--p 00002000 00:22 23344 /home/localsoft/slurm/lib/slurm/switch_none.so 7fc1ffb7f000-7fc1ffb80000 rw-p 00003000 00:22 23344 /home/localsoft/slurm/lib/slurm/switch_none.so 7fc1ffb80000-7fc1ffb8f000 r-xp 00000000 00:22 23335 /home/localsoft/slurm/lib/slurm/select_linear.so 7fc1ffb8f000-7fc1ffd8e000 ---p 0000f000 00:22 23335 /home/localsoft/slurm/lib/slurm/select_linear.so 7fc1ffd8e000-7fc1ffd8f000 r--p 0000e000 00:22 23335 /home/localsoft/slurm/lib/slurm/select_linear.so 7fc1ffd8f000-7fc1ffd90000 rw-p 0000f000 00:22 23335 /home/localsoft/slurm/lib/slurm/select_linear.so 7fc1ffd90000-7fc1ffd9b000 r-xp 00000000 08:17 67110724 /usr/lib64/libnss_files-2.17.so 7fc1ffd9b000-7fc1fff9a000 ---p 0000b000 08:17 67110724 /usr/lib64/libnss_files-2.17.so 7fc1fff9a000-7fc1fff9b000 r--p 0000a000 08:17 67110724 /usr/lib64/libnss_files-2.17.so 7fc1fff9b000-7fc1fff9c000 rw-p 0000b000 08:17 67110724 /usr/lib64/libnss_files-2.17.soAbortado (`core' generado) --- --- It is important to note that if mpi is disabled the execution suceeds. Moreover, it has the same behaviour when using any option listed with "--mpi=list" srun --mpi=none -n 2 --cpus-per-task=1 --ntasks-per-node=1 ./helloWorldMPI Of course this is not a solution, as this creates 3 jobs with 1 thread instead of 1 job with 3 threads, but maybe helps finding the problem. I assumed that it was a problem of my configuration. However, I tried downgrading Slurm to the previous release version (slurm 15.08.4) , and now it works fine. Summing up, -in virtual environment, slurm 16.05.0-0pre1, srun and sbatch work -in real environment, slurm 15.08, srun and sbatch work -in real environment, slurm 16.05.0-0pre1, sbatch works, srun DOES NOT work. Any hints? Thanks for your help, Manuel -- Dr. Manuel Rodríguez-Pascual skype: manuel.rodriguez.pascual phone: (+34) 913466173 // (+34) 679925108 CIEMAT-Moncloa Edificio 22, desp. 1.25 Avenida Complutense, 40 28040- MADRID SPAIN