BTW: I've confirmed this only happens if you provide the hostfile info key.
A simple comm_spawn without the hostfile key works just fine.


On Sun, Feb 1, 2015 at 8:53 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Well, I can reproduce it - but I won’t have time to address it until I
> return later this week.
>
> Whether or not procs get spawned onto a remote host depends on the number
> of local slots. You asked for 8 processes, so if there are more than 8
> slots on the node, then it will launch them all on the local node. If you
> want to spread them across nodes, you need to use —map-by node
>
> Also, FWIW: this job will “hang” as the spawned procs (“hostname”) never
> call MPI_Init. You can only use MPI_Comm_spawn to launch MPI processes as
> the spawning parent will blissfully wait forever for the child processes to
> call MPI_Connect.
>
>
> > On Jan 26, 2015, at 11:29 AM, Evan <evan.sama...@gmail.com> wrote:
> >
> > Hi,
> >
> > I am using OpenMPI 1.8.4 on a Ubuntu 14.04 machine and 5 Ubuntu 12.04
> machines.  I am using ssh to launch MPI jobs and I'm able to run simple
> programs like 'mpirun -np 8 --host localhost,pachy1 hostname' and get the
> expected output (pachy1 being an entry in my /etc/hosts file).
> >
> > I started using MPI_Comm_spawn in my app with the intent of NOT calling
> mpirun to launch the program that calls MPI_Comm_spawn (my attempt at using
> the Singleton MPI_INIT pattern described in 10.5.2 of MPI 3.0 standard).
> The app needs to launch an MPI job of a given size from a given hostfile,
> where the job needs to report some info back to the app, so it seemed
> MPI_Comm_spawn was my best bet.  The app is only rarely going to be used
> this way, thus mpirun not being used to launch the app that is the parent
> in the MPI_Comm_spawn operation.  This pattern works fine if the only
> entries in the hostfile are 'localhost'.  However if I add a host that
> isn't local I get a segmentation fault from the orted process.
> >
> > In any case, I distilled my example down as small as I could.  I've
> attached the C code of the master and the hostfile I'm using. Here's the
> output:
> >
> > evan@lasarti:~/devel/toy_progs/mpi_spawn$ ./master
> ~/mpi/test_distributed.hostfile
> > [lasarti:32020] [[21014,1],0] FORKING HNP: orted --hnp --set-sid
> --report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca
> ess_base_jobid 1377173504
> > [lasarti:32022] *** Process received signal ***
> > [lasarti:32022] Signal: Segmentation fault (11)
> > [lasarti:32022] Signal code: Address not mapped (1)
> > [lasarti:32022] Failing at address: (nil)
> > [lasarti:32022] [ 0]
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f07af039340]
> > [lasarti:32022] [ 1]
> /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_hwloc191_hwloc_get_obj_by_depth+0x32)[0x7f07aea227c2]
> > [lasarti:32022] [ 2]
> /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_hwloc_base_get_nbobjs_by_type+0x90)[0x7f07ae9f5430]
> > [lasarti:32022] [ 3]
> /opt/openmpi-1.8.4/lib/openmpi/mca_rmaps_round_robin.so(orte_rmaps_rr_byobj+0x134)[0x7f07ab2fb154]
> > [lasarti:32022] [ 4]
> /opt/openmpi-1.8.4/lib/openmpi/mca_rmaps_round_robin.so(+0x12c6)[0x7f07ab2fa2c6]
> > [lasarti:32022] [ 5]
> /opt/openmpi-1.8.4/lib/libopen-rte.so.7(orte_rmaps_base_map_job+0x21a)[0x7f07af299f7a]
> > [lasarti:32022] [ 6]
> /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x6e4)[0x7f07ae9e7034]
> > [lasarti:32022] [ 7]
> /opt/openmpi-1.8.4/lib/libopen-rte.so.7(orte_daemon+0xdff)[0x7f07af27a86f]
> > [lasarti:32022] [ 8] orted(main+0x47)[0x400877]
> > [lasarti:32022] [ 9]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f07aec84ec5]
> > [lasarti:32022] [10] orted[0x4008cb]
> > [lasarti:32022] *** End of error message ***
> >
> > If I launch 'master.c' using mpirun, I don't get a segmentation fault,
> but it doesn't seem to be launching the process on anything more than
> localhost, no matter what hostfile I give it.
> >
> > For what it's worth, I fully expected to debug some path issues
> regarding the binary I wanted to launch with MPI_Comm_spawn when I used
> this distributed, but this error at first glance doesn't appear to have
> anything to do with that.  I'm sure this is something silly I'm doing
> wrong, but I don't really know how to debug this further given this error.
> >
> > Evan
> >
> > P.S. Only including zipped config.log since the "ompi_info -v ompi full
> --parsable" command I got from http://www.open-mpi.org/community/help/
> doesn't seem to work anymore.
> >
> >
> >
> <master.c><test_distributed.hostfile><config.log.tar.bz2>_______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/01/26235.php
>
>

Reply via email to