Hi all,

I am facing a quite weird error on the last version of slurm
(16.05.0-0pre1). System crashes when executing srun.

So, I have 2 experimental testbeds. One is based on virtual machines,
and one is a physical one.

Both clusters, pgusical and virtual, run
-OS: CentOS7, updated
- MPICH Version:     3.1.4
- slurm 16.05.0-0pre1
- munge-0.5.11
- slurm.conf configured with "MpiDefault=pmi2"

I have a test helloWorldMPI application.

So, in the virtual cluster, the application can be executed with
---
----
srun  -n 2 --cpus-per-task=1 --ntasks-per-node=1 ./helloWorldMPI
sbatch  -n 2 --cpus-per-task=1 --ntasks-per-node=1 helloWorldMPI.sh (a
script with a single line, "mpiexec helloWorldMPI"-
---
---


both work OK.

However, in the physical cluster, I can run the sbatch command, but
the srun one crashes.

---
---
-bash-4.2$ srun --version
slurm 16.05.0-0pre1

-bash-4.2$ srun  -n 2 --cpus-per-task=1 --ntasks-per-node=1 ./helloWorldMPI
*** Error in `srun': free(): invalid pointer: 0x00007fc1ff774ed0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7d1fd)[0x7fc2000191fd]
srun(slurm_xfree+0x49)[0x442ce6]
srun(slurm_free_forward_data_msg+0x34)[0x4c0a34]
srun(slurm_free_msg_data+0xc70)[0x4c66b6]
srun(slurm_free_msg+0x53)[0x4864ae]
/home/localsoft/slurm/lib/slurm/mpi_pmi2.so(tree_msg_to_stepds+0x189)[0x7fc1ff56cbc3]
/home/localsoft/slurm/lib/slurm/mpi_pmi2.so(temp_kvs_send+0xd7)[0x7fc1ff563bfc]
/home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0xf18d)[0x7fc1ff56b18d]
/home/localsoft/slurm/lib/slurm/mpi_pmi2.so(handle_tree_cmd+0x49d)[0x7fc1ff56c601]
/home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0x556f)[0x7fc1ff56156f]
/home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0x5760)[0x7fc1ff561760]
srun[0x428b58]
srun[0x42891e]
srun(eio_handle_mainloop+0x1b0)[0x428528]
/home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0x5b41)[0x7fc1ff561b41]
/lib64/libpthread.so.0(+0x7df5)[0x7fc200364df5]
/lib64/libc.so.6(clone+0x6d)[0x7fc2000921ad]
======= Memory map: ========
00400000-005c2000 r-xp 00000000 00:22 26456
  /home/localsoft/slurm/bin/srun
007c1000-007c2000 r--p 001c1000 00:22 26456
  /home/localsoft/slurm/bin/srun
007c2000-007c9000 rw-p 001c2000 00:22 26456
  /home/localsoft/slurm/bin/srun
007c9000-007cf000 rw-p 00000000 00:00 0
02628000-0288d000 rw-p 00000000 00:00 0                                  [heap]
7fc1e0000000-7fc1e0021000 rw-p 00000000 00:00 0
7fc1e0021000-7fc1e4000000 ---p 00000000 00:00 0
7fc1e8000000-7fc1e8021000 rw-p 00000000 00:00 0
7fc1e8021000-7fc1ec000000 ---p 00000000 00:00 0
7fc1ec000000-7fc1ec021000 rw-p 00000000 00:00 0
7fc1ec021000-7fc1f0000000 ---p 00000000 00:00 0
7fc1f0000000-7fc1f0021000 rw-p 00000000 00:00 0
7fc1f0021000-7fc1f4000000 ---p 00000000 00:00 0
7fc1f4000000-7fc1f4021000 rw-p 00000000 00:00 0
7fc1f4021000-7fc1f8000000 ---p 00000000 00:00 0
7fc1f8000000-7fc1f8021000 rw-p 00000000 00:00 0
7fc1f8021000-7fc1fc000000 ---p 00000000 00:00 0
7fc1fe030000-7fc1fe045000 r-xp 00000000 08:17 67109001
  /usr/lib64/libgcc_s-4.8.3-20140911.so.1
7fc1fe045000-7fc1fe244000 ---p 00015000 08:17 67109001
  /usr/lib64/libgcc_s-4.8.3-20140911.so.1
7fc1fe244000-7fc1fe245000 r--p 00014000 08:17 67109001
  /usr/lib64/libgcc_s-4.8.3-20140911.so.1
7fc1fe245000-7fc1fe246000 rw-p 00015000 08:17 67109001
  /usr/lib64/libgcc_s-4.8.3-20140911.so.1
7fc1fe246000-7fc1fe247000 ---p 00000000 00:00 0
7fc1fe247000-7fc1fe347000 rw-p 00000000 00:00 0
7fc1fe347000-7fc1fe348000 ---p 00000000 00:00 0
7fc1fe348000-7fc1fe448000 rw-p 00000000 00:00 0
7fc1fe448000-7fc1fe449000 r-xp 00000000 00:22 21727
  /home/localsoft/slurm/lib/slurm/route_default.so
7fc1fe449000-7fc1fe648000 ---p 00001000 00:22 21727
  /home/localsoft/slurm/lib/slurm/route_default.so
7fc1fe648000-7fc1fe649000 r--p 00000000 00:22 21727
  /home/localsoft/slurm/lib/slurm/route_default.so
7fc1fe649000-7fc1fe64a000 rw-p 00001000 00:22 21727
  /home/localsoft/slurm/lib/slurm/route_default.so
7fc1fe64a000-7fc1fe64b000 ---p 00000000 00:00 0
7fc1fe64b000-7fc1fe74b000 rw-p 00000000 00:00 0
  [stack:16505]
7fc1fe74b000-7fc1fe74c000 ---p 00000000 00:00 0
7fc1fe74c000-7fc1fe84c000 rw-p 00000000 00:00 0
  [stack:16504]
7fc1fe84c000-7fc1fe84d000 ---p 00000000 00:00 0
7fc1fe84d000-7fc1ff04d000 rw-p 00000000 00:00 0
  [stack:16503]
7fc1ff04d000-7fc1ff04e000 ---p 00000000 00:00 0
7fc1ff04e000-7fc1ff14e000 rw-p 00000000 00:00 0
  [stack:16502]
7fc1ff14e000-7fc1ff157000 r-xp 00000000 08:17 67390882
  /usr/lib64/libmunge.so.2.0.0
7fc1ff157000-7fc1ff356000 ---p 00009000 08:17 67390882
  /usr/lib64/libmunge.so.2.0.0
7fc1ff356000-7fc1ff357000 r--p 00008000 08:17 67390882
  /usr/lib64/libmunge.so.2.0.0
7fc1ff357000-7fc1ff358000 rw-p 00009000 08:17 67390882
  /usr/lib64/libmunge.so.2.0.0
7fc1ff358000-7fc1ff35b000 r-xp 00000000 00:22 1228
  /home/localsoft/slurm/lib/slurm/auth_munge.so
7fc1ff35b000-7fc1ff55a000 ---p 00003000 00:22 1228
  /home/localsoft/slurm/lib/slurm/auth_munge.so
7fc1ff55a000-7fc1ff55b000 r--p 00002000 00:22 1228
  /home/localsoft/slurm/lib/slurm/auth_munge.so
7fc1ff55b000-7fc1ff55c000 rw-p 00003000 00:22 1228
  /home/localsoft/slurm/lib/slurm/auth_munge.so
7fc1ff55c000-7fc1ff574000 r-xp 00000000 00:22 29464
  /home/localsoft/slurm/lib/slurm/mpi_pmi2.so
7fc1ff574000-7fc1ff773000 ---p 00018000 00:22 29464
  /home/localsoft/slurm/lib/slurm/mpi_pmi2.so
7fc1ff773000-7fc1ff774000 r--p 00017000 00:22 29464
  /home/localsoft/slurm/lib/slurm/mpi_pmi2.so
7fc1ff774000-7fc1ff775000 rw-p 00018000 00:22 29464
  /home/localsoft/slurm/lib/slurm/mpi_pmi2.so
7fc1ff775000-7fc1ff77a000 r-xp 00000000 00:22 1057
  /home/localsoft/slurm/lib/slurm/launch_slurm.so
7fc1ff77a000-7fc1ff97a000 ---p 00005000 00:22 1057
  /home/localsoft/slurm/lib/slurm/launch_slurm.so
7fc1ff97a000-7fc1ff97b000 r--p 00005000 00:22 1057
  /home/localsoft/slurm/lib/slurm/launch_slurm.so
7fc1ff97b000-7fc1ff97c000 rw-p 00006000 00:22 1057
  /home/localsoft/slurm/lib/slurm/launch_slurm.so
7fc1ff97c000-7fc1ff97e000 r-xp 00000000 00:22 23344
  /home/localsoft/slurm/lib/slurm/switch_none.so
7fc1ff97e000-7fc1ffb7e000 ---p 00002000 00:22 23344
  /home/localsoft/slurm/lib/slurm/switch_none.so
7fc1ffb7e000-7fc1ffb7f000 r--p 00002000 00:22 23344
  /home/localsoft/slurm/lib/slurm/switch_none.so
7fc1ffb7f000-7fc1ffb80000 rw-p 00003000 00:22 23344
  /home/localsoft/slurm/lib/slurm/switch_none.so
7fc1ffb80000-7fc1ffb8f000 r-xp 00000000 00:22 23335
  /home/localsoft/slurm/lib/slurm/select_linear.so
7fc1ffb8f000-7fc1ffd8e000 ---p 0000f000 00:22 23335
  /home/localsoft/slurm/lib/slurm/select_linear.so
7fc1ffd8e000-7fc1ffd8f000 r--p 0000e000 00:22 23335
  /home/localsoft/slurm/lib/slurm/select_linear.so
7fc1ffd8f000-7fc1ffd90000 rw-p 0000f000 00:22 23335
  /home/localsoft/slurm/lib/slurm/select_linear.so
7fc1ffd90000-7fc1ffd9b000 r-xp 00000000 08:17 67110724
  /usr/lib64/libnss_files-2.17.so
7fc1ffd9b000-7fc1fff9a000 ---p 0000b000 08:17 67110724
  /usr/lib64/libnss_files-2.17.so
7fc1fff9a000-7fc1fff9b000 r--p 0000a000 08:17 67110724
  /usr/lib64/libnss_files-2.17.so
7fc1fff9b000-7fc1fff9c000 rw-p 0000b000 08:17 67110724
  /usr/lib64/libnss_files-2.17.soAbortado (`core' generado)

---
---

It is important to note that if mpi is disabled the execution suceeds.
Moreover, it has the same behaviour when using any option listed with
"--mpi=list"

srun --mpi=none  -n 2 --cpus-per-task=1 --ntasks-per-node=1 ./helloWorldMPI

Of course this is not a solution, as this creates 3 jobs with 1 thread
instead of 1 job with 3 threads, but maybe helps finding the problem.


I assumed that it was a problem of my configuration. However, I tried
downgrading Slurm to the previous release version (slurm 15.08.4) ,
and now it works fine.

Summing up,

-in virtual environment, slurm 16.05.0-0pre1, srun and sbatch work
-in real environment, slurm 15.08, srun and sbatch work
-in real environment, slurm 16.05.0-0pre1, sbatch works, srun DOES NOT work.

Any hints?


Thanks for your help,


Manuel



-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN

Reply via email to