Hi,

I have installed openmpi-v2.0.0-233-gb5f0a4f on my "SUSE Linux
Enterprise Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.1.0.
Unfortunately I have a problem with my program "spawn_master".
It hangs if I run it on my local machine and I get I segmentation
fault if I run it on a remote machine. Both machines use the same
operating system. Everything works as expected if I use five times
the same hostname in "--host" instead of a combination of "--host"
and "slot-list". Everything works also as expected if I use my
program "spawn_multiple_master" instead of "spawn_master".


loki hello_2 151 ompi_info | grep -e "Open MPI repo revision" -e "C compiler absolute"
  Open MPI repo revision: v2.0.0-233-gb5f0a4f
     C compiler absolute: /opt/solstudio12.5b/bin/cc


loki spawn 152 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

^C
loki spawn 153 mpiexec -np 1 --host nfs1 --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on nfs1
  I create 4 slave processes

[nfs1:09963] *** Process received signal ***
[nfs1:09963] Signal: Segmentation fault (11)
[nfs1:09963] Signal code: Address not mapped (1)
[nfs1:09963] Failing at address: 0x64
[nfs1:09963] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f6f55794870]
[nfs1:09963] [ 1] /usr/local/openmpi-2.0.1_64_cc/lib64/openmpi/mca_state_orted.so(+0x1055a)[0x7f6f5478155a] [nfs1:09963] [ 2] /usr/local/openmpi-2.0.1_64_cc/lib64/libopen-pal.so.20(+0x2306a4)[0x7f6f566f46a4] [nfs1:09963] [ 3] /usr/local/openmpi-2.0.1_64_cc/lib64/libopen-pal.so.20(+0x230a2a)[0x7f6f566f4a2a] [nfs1:09963] [ 4] /usr/local/openmpi-2.0.1_64_cc/lib64/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x2d9)[0x7f6f566f5379] [nfs1:09963] [ 5] /usr/local/openmpi-2.0.1_64_cc/lib64/libopen-rte.so.20(orte_daemon+0x2b66)[0x7f6f56cf63c6]
[nfs1:09963] [ 6] orted[0x407575]
[nfs1:09963] [ 7] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6f553feb25]
[nfs1:09963] [ 8] orted[0x401832]
[nfs1:09963] *** End of error message ***
Segmentation fault
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:

  hostname:  nfs1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
loki spawn 154




loki spawn 144 mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Slave process 0 of 4 running on loki
spawn_slave 0: argv[0]: spawn_slave
Slave process 1 of 4 running on loki
spawn_slave 1: argv[0]: spawn_slave
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
Slave process 3 of 4 running on loki
spawn_slave 3: argv[0]: spawn_slave
Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 4

loki spawn 145



loki spawn 106 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_multiple_master

Parent process 0 running on loki
  I create 3 slave processes.

Slave process 0 of 2 running on loki
Slave process 1 of 2 running on loki
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 1: argv[1]: program type 2
spawn_slave 1: argv[2]: another parameter
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 0: argv[1]: program type 1
Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 2


loki spawn 107 mpiexec -np 1 --host nfs1 --slot-list 0:0-5,1:0-5 spawn_multiple_master

Parent process 0 running on nfs1
  I create 3 slave processes.

Slave process 0 of 2 running on nfs1
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 0: argv[1]: program type 1
Slave process 1 of 2 running on nfs1
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 1: argv[1]: program type 2
spawn_slave 1: argv[2]: another parameter
Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 2

loki spawn 108



I would be grateful, if somebody can fix the problem. Thank you
very much for any help in advance.


Kind regards

Siegmar
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to