
I have installed openmpi-v2.0.0-233-gb5f0a4f on my "SUSE Linux
Enterprise Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.1.0.
Unfortunately I have a problem with my program "spawn_master".
It hangs if I run it on my local machine and I get I segmentation
fault if I run it on a remote machine. Both machines use the same
operating system. Everything works as expected if I use five times
the same hostname in "--host" instead of a combination of "--host"
and "slot-list". Everything works also as expected if I use my
program "spawn_multiple_master" instead of "spawn_master".

loki hello_2 151 ompi_info | grep -e "Open MPI repo revision" -e "C compiler absolute"
  Open MPI repo revision: v2.0.0-233-gb5f0a4f
     C compiler absolute: /opt/solstudio12.5b/bin/cc

loki spawn 152 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

loki spawn 153 mpiexec -np 1 --host nfs1 --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on nfs1
  I create 4 slave processes

[nfs1:09963] *** Process received signal ***
[nfs1:09963] Signal: Segmentation fault (11)
[nfs1:09963] Signal code: Address not mapped (1)
[nfs1:09963] Failing at address: 0x64
[nfs1:09963] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f6f55794870]
[nfs1:09963] [ 1] /usr/local/openmpi-2.0.1_64_cc/lib64/openmpi/mca_state_orted.so(+0x1055a)[0x7f6f5478155a] [nfs1:09963] [ 2] /usr/local/openmpi-2.0.1_64_cc/lib64/libopen-pal.so.20(+0x2306a4)[0x7f6f566f46a4] [nfs1:09963] [ 3] /usr/local/openmpi-2.0.1_64_cc/lib64/libopen-pal.so.20(+0x230a2a)[0x7f6f566f4a2a] [nfs1:09963] [ 4] /usr/local/openmpi-2.0.1_64_cc/lib64/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x2d9)[0x7f6f566f5379] [nfs1:09963] [ 5] /usr/local/openmpi-2.0.1_64_cc/lib64/libopen-rte.so.20(orte_daemon+0x2b66)[0x7f6f56cf63c6]
[nfs1:09963] [ 6] orted[0x407575]
[nfs1:09963] [ 7] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6f553feb25]
[nfs1:09963] [ 8] orted[0x401832]
[nfs1:09963] *** End of error message ***
Segmentation fault
ORTE has lost communication with its daemon located on node:

  hostname:  nfs1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
loki spawn 154

loki spawn 144 mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Slave process 0 of 4 running on loki
spawn_slave 0: argv[0]: spawn_slave
Slave process 1 of 4 running on loki
spawn_slave 1: argv[0]: spawn_slave
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
Slave process 3 of 4 running on loki
spawn_slave 3: argv[0]: spawn_slave
Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 4

loki spawn 145

loki spawn 106 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_multiple_master

Parent process 0 running on loki
  I create 3 slave processes.

Slave process 0 of 2 running on loki
Slave process 1 of 2 running on loki
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 1: argv[1]: program type 2
spawn_slave 1: argv[2]: another parameter
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 0: argv[1]: program type 1
Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 2

loki spawn 107 mpiexec -np 1 --host nfs1 --slot-list 0:0-5,1:0-5 spawn_multiple_master

Parent process 0 running on nfs1
  I create 3 slave processes.

Slave process 0 of 2 running on nfs1
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 0: argv[1]: program type 1
Slave process 1 of 2 running on nfs1
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 1: argv[1]: program type 2
spawn_slave 1: argv[2]: another parameter
Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 2

loki spawn 108

I would be grateful, if somebody can fix the problem. Thank you
very much for any help in advance.

Kind regards

users mailing list

Reply via email to