BTW: I've confirmed this only happens if you provide the hostfile info key. A simple comm_spawn without the hostfile key works just fine.
On Sun, Feb 1, 2015 at 8:53 PM, Ralph Castain <r...@open-mpi.org> wrote: > Well, I can reproduce it - but I won’t have time to address it until I > return later this week. > > Whether or not procs get spawned onto a remote host depends on the number > of local slots. You asked for 8 processes, so if there are more than 8 > slots on the node, then it will launch them all on the local node. If you > want to spread them across nodes, you need to use —map-by node > > Also, FWIW: this job will “hang” as the spawned procs (“hostname”) never > call MPI_Init. You can only use MPI_Comm_spawn to launch MPI processes as > the spawning parent will blissfully wait forever for the child processes to > call MPI_Connect. > > > > On Jan 26, 2015, at 11:29 AM, Evan <evan.sama...@gmail.com> wrote: > > > > Hi, > > > > I am using OpenMPI 1.8.4 on a Ubuntu 14.04 machine and 5 Ubuntu 12.04 > machines. I am using ssh to launch MPI jobs and I'm able to run simple > programs like 'mpirun -np 8 --host localhost,pachy1 hostname' and get the > expected output (pachy1 being an entry in my /etc/hosts file). > > > > I started using MPI_Comm_spawn in my app with the intent of NOT calling > mpirun to launch the program that calls MPI_Comm_spawn (my attempt at using > the Singleton MPI_INIT pattern described in 10.5.2 of MPI 3.0 standard). > The app needs to launch an MPI job of a given size from a given hostfile, > where the job needs to report some info back to the app, so it seemed > MPI_Comm_spawn was my best bet. The app is only rarely going to be used > this way, thus mpirun not being used to launch the app that is the parent > in the MPI_Comm_spawn operation. This pattern works fine if the only > entries in the hostfile are 'localhost'. However if I add a host that > isn't local I get a segmentation fault from the orted process. > > > > In any case, I distilled my example down as small as I could. I've > attached the C code of the master and the hostfile I'm using. Here's the > output: > > > > evan@lasarti:~/devel/toy_progs/mpi_spawn$ ./master > ~/mpi/test_distributed.hostfile > > [lasarti:32020] [[21014,1],0] FORKING HNP: orted --hnp --set-sid > --report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca > ess_base_jobid 1377173504 > > [lasarti:32022] *** Process received signal *** > > [lasarti:32022] Signal: Segmentation fault (11) > > [lasarti:32022] Signal code: Address not mapped (1) > > [lasarti:32022] Failing at address: (nil) > > [lasarti:32022] [ 0] > /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f07af039340] > > [lasarti:32022] [ 1] > /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_hwloc191_hwloc_get_obj_by_depth+0x32)[0x7f07aea227c2] > > [lasarti:32022] [ 2] > /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_hwloc_base_get_nbobjs_by_type+0x90)[0x7f07ae9f5430] > > [lasarti:32022] [ 3] > /opt/openmpi-1.8.4/lib/openmpi/mca_rmaps_round_robin.so(orte_rmaps_rr_byobj+0x134)[0x7f07ab2fb154] > > [lasarti:32022] [ 4] > /opt/openmpi-1.8.4/lib/openmpi/mca_rmaps_round_robin.so(+0x12c6)[0x7f07ab2fa2c6] > > [lasarti:32022] [ 5] > /opt/openmpi-1.8.4/lib/libopen-rte.so.7(orte_rmaps_base_map_job+0x21a)[0x7f07af299f7a] > > [lasarti:32022] [ 6] > /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x6e4)[0x7f07ae9e7034] > > [lasarti:32022] [ 7] > /opt/openmpi-1.8.4/lib/libopen-rte.so.7(orte_daemon+0xdff)[0x7f07af27a86f] > > [lasarti:32022] [ 8] orted(main+0x47)[0x400877] > > [lasarti:32022] [ 9] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f07aec84ec5] > > [lasarti:32022] [10] orted[0x4008cb] > > [lasarti:32022] *** End of error message *** > > > > If I launch 'master.c' using mpirun, I don't get a segmentation fault, > but it doesn't seem to be launching the process on anything more than > localhost, no matter what hostfile I give it. > > > > For what it's worth, I fully expected to debug some path issues > regarding the binary I wanted to launch with MPI_Comm_spawn when I used > this distributed, but this error at first glance doesn't appear to have > anything to do with that. I'm sure this is something silly I'm doing > wrong, but I don't really know how to debug this further given this error. > > > > Evan > > > > P.S. Only including zipped config.log since the "ompi_info -v ompi full > --parsable" command I got from http://www.open-mpi.org/community/help/ > doesn't seem to work anymore. > > > > > > > <master.c><test_distributed.hostfile><config.log.tar.bz2>_______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/01/26235.php > >