Re: [OMPI users] orted seg fault when using MPI_Comm_spawn on more than one host

2015-02-04 Thread Evan Samanas
Indeed, I simply commented out all the MPI_Info stuff, which you
essentially did by passing a dummy argument.  I'm still not able to get it
to succeed.

So here we go, my results defy logic.  I'm sure this could be my
fault...I've only been an occasional user of OpenMPI and MPI in general
over the years and I've never used MPI_Comm_spawn before this project. I
tested simple_spawn like so:
mpicc simple_spawn.c -o simple_spawn
./simple_spawn

When my default hostfile points to a file that just lists localhost, this
test completes successfully.  If it points to my hostfile with localhost
and 5 remote hosts, here's the output:
evan@lasarti:~/devel/toy_progs/mpi_spawn$ mpicc simple_spawn.c -o
simple_spawn
evan@lasarti:~/devel/toy_progs/mpi_spawn$ ./simple_spawn
[pid 5703] starting up!
0 completed MPI_Init
Parent [pid 5703] about to spawn!
[lasarti:05703] [[14661,1],0] FORKING HNP: orted --hnp --set-sid
--report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca
ess_base_jobid 960823296
[lasarti:05705] *** Process received signal ***
[lasarti:05705] Signal: Segmentation fault (11)
[lasarti:05705] Signal code: Address not mapped (1)
[lasarti:05705] Failing at address: (nil)
[lasarti:05705] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7fc185dcf340]
[lasarti:05705] [ 1]
/opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-rte.so.7(orte_rmaps_base_compute_bindings+0x650)[0x7fc186033bb0]
[lasarti:05705] [ 2]
/opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-rte.so.7(orte_rmaps_base_map_job+0x939)[0x7fc18602fb99]
[lasarti:05705] [ 3]
/opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x6e4)[0x7fc18577dcc4]
[lasarti:05705] [ 4]
/opt/openmpi-v1.8.4-54-g07f735a/lib/libopen-rte.so.7(orte_daemon+0xdf8)[0x7fc186010438]
[lasarti:05705] [ 5] orted(main+0x47)[0x400887]
[lasarti:05705] [ 6]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fc185a1aec5]
[lasarti:05705] [ 7] orted[0x4008db]
[lasarti:05705] *** End of error message ***

You can see from the message that this particular run IS from the latest
snapshot, though the failure happens on v.1.8.4 as well.  I didn't bother
installing the snapshot on the remote nodes though.  Should I do that?  It
looked to me like this error happened well before we got to a remote node,
so that's why I didn't.

Your thoughts?

Evan



On Tue, Feb 3, 2015 at 7:40 PM, Ralph Castain  wrote:

> I confess I am sorely puzzled. I replace the Info key with MPI_INFO_NULL,
> but still had to pass a bogus argument to master since you still have the
> Info_set code in there - otherwise, info_set segfaults due to a NULL
> argv[1]. Doing that (and replacing "hostname" with an MPI example code)
> makes everything work just fine.
>
> I've attached one of our example comm_spawn codes that we test against -
> it also works fine with the current head of the 1.8 code base. I confess
> that some changes have been made since 1.8.4 was released, and it is
> entirely possible that this was a problem in 1.8.4 and has since been fixed.
>
> So I'd suggest trying with the nightly 1.8 tarball and seeing if it works
> for you. You can download it from here:
>
> http://www.open-mpi.org/nightly/v1.8/
>
> HTH
> Ralph
>
>
> On Tue, Feb 3, 2015 at 6:20 PM, Evan Samanas 
> wrote:
>
>> Yes, I did.  I replaced the info argument of MPI_Comm_spawn with
>> MPI_INFO_NULL.
>>
>> On Tue, Feb 3, 2015 at 5:54 PM, Ralph Castain  wrote:
>>
>>> When running your comm_spawn code, did you remove the Info key code? You
>>> wouldn't need to provide a hostfile or hosts any more, which is why it
>>> should resolve that problem.
>>>
>>> I agree that providing either hostfile or host as an Info key will cause
>>> the program to segfault - I'm woking on that issue.
>>>
>>>
>>> On Tue, Feb 3, 2015 at 3:46 PM, Evan Samanas 
>>> wrote:
>>>
 Setting these environment variables did indeed change the way mpirun
 maps things, and I didn't have to specify a hostfile.  However, setting
 these for my MPI_Comm_spawn code still resulted in the same segmentation
 fault.

 Evan

 On Tue, Feb 3, 2015 at 10:09 AM, Ralph Castain 
 wrote:

> If you add the following to your environment, you should run on
> multiple nodes:
>
> OMPI_MCA_rmaps_base_mapping_policy=node
> OMPI_MCA_orte_default_hostfile=
>
> The first tells OMPI to map-by node. The second passes in your default
> hostfile so you don't need to specify it as an Info key.
>
> HTH
> Ralph
>
>
> On Tue, Feb 3, 2015 at 9:23 AM, Evan Samanas 
> wrote:
>
>> Hi Ralph,
>>
>> Good to know you've reproduced it.  I was experiencing this using
>> both the hostfile and host key.  A simple comm_spawn was working for me 
>> as
>> well, but it was only launching locally, and I'm pretty sure each node 
>> only

Re: [OMPI users] prob in running two mpi merged program (UNCLASSIFIED)

2015-02-04 Thread Muhammad Ashfaqur Rahman
Dear Andrew Burns,
Thank you for your ideas. Your guess is partly correct, I am trying to
merge two sets of programs into one executable and then run in mpi.
As per your suggestion, I have omitted the MPI_Finalize from of one set.
And also commented the MPI_Barrier in some parts.
But still it is serial.
For your idea: attached here Makefile.

Regards
Ashfaq


On Tue, Feb 3, 2015 at 6:26 PM, Burns, Andrew J CTR (US) <
andrew.j.burns35@mail.mil> wrote:

> Classification: UNCLASSIFIED
> Caveats: NONE
>
> If I could venture a guess, it sounds like you are trying to merge two
> separate programs into one executable and run them in parallel
> via MPI.
>
> The problem sounds like an issue where your program starts in parallel but
> then changes back to serial while the program is still
> executing.
>
> I can't be entirely sure without looking at the code itself.
>
> One guess is that MPI_Finalize is in the wrong location. Finalize should
> be called to end the parallel section and move the program
> back to serial. Typically this means that Finalize will be very close to
> the last line of the program.
>
> It may also be possible that with the way your program is structured, the
> effect is effectively serial since only one core is
> processing at any given moment. This may be due to extensive use of
> barrier or similar functions.
>
> Andrew Burns
> Lockheed Martin
> Software Engineer
> 410-306-0409
> ARL DSRC
> andrew.j.bur...@us.army.mil
> andrew.j.burns35@mail.mil
>
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Tuesday, February 03, 2015 9:05 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] prob in running two mpi merged program
>
> I'm afraid I don't quite understand what you are saying, so let's see if I
> can clarify. You have two fortran MPI programs. You start
> one using "mpiexec". You then start the other one as a singleton - i.e.,
> you just run "myapp" without using mpiexec. The two apps are
> attempting to execute an MPI_Connect/accept so they can "join".
>
> Is that correct? You mention MPICH in your statement about one of the
> procs - are you using MPICH or Open MPI? If the latter, which
> version are you using?
>
> Ralph
>
>
> On Mon, Feb 2, 2015 at 11:35 PM, Muhammad Ashfaqur Rahman <
> ashfaq...@gmail.com> wrote:
>
>
> Dear All,
> Take my greetings. I am new in mpi usage. I have problems in
> parallel run, when two fortran mpi programs are merged to one
> executable. If these two are separate, then they are running parallel.
>
> One program has used spmd and another one  has used mpich header
> directly.
>
> Other issue is that while trying to run the above mentioned merged
> program in mpi, it's first started with separate parallel
> instances of same step and then after some steps it becomes serial.
>
> Please help me in this regards
>
> Ashfaq
> Ph.D Student
> Dept. of Meteorology
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/02/26264.php
>
>
>
>
> Classification: UNCLASSIFIED
> Caveats: NONE
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/02/26266.php
>


Makefile
Description: Binary data


Re: [OMPI users] independent startup of orted and orterun

2015-02-04 Thread Ralph Castain
We're going to take this off-list so we quit peppering you all with the
development...will report back when we have something more concrete should
anyone else be interested.



On Wed, Feb 4, 2015 at 2:22 AM, Mark Santcroos 
wrote:

> Ok great, sounds like a plan!
>
> > On 04 Feb 2015, at 2:53 , Ralph Castain  wrote:
> >
> > Appreciate your patience! I'm somewhat limited this week by being on
> travel to our HQ, so I don't have access to my usual test cluster. I'll be
> better situated to complete the implementation once I get home.
> >
> > For now, some quick thoughts:
> >
> > 1. stdout/stderr: yes, I just need to "register" orte-submit as the one
> to receive those from the submitted job.
> >
> > 2. That one is going to be a tad trickier, but is resolvable. May take
> me a little longer to fix.
> >
> > 3. dang - I thought I had it doing so. I'll look to find the issue. I
> suspect it's just a case of correctly setting the return code of
> orte-submit.
> >
> > I'd welcome the help! Let me ponder the best way to point you to the
> areas needing work, and we can kick around off-list about who does what.
> >
> > Great to hear this is working with your tool so quickly!!
> > Ralph
> >
> >
> > On Tue, Feb 3, 2015 at 3:49 PM, Mark Santcroos <
> mark.santcr...@rutgers.edu> wrote:
> > Hi Ralph,
> >
> > Besides the items in the other mail, I have three more items that would
> need resolving at some point.
> >
> > 1. STDOUT/STDERR currently go to the orte-dvm console.
> >I'm sure this is not a fundamental limitation.
> >Even if getting the information to the orte-submit instance would be
> problematic, the orte-dvm writing this to a file per session would be good
> enough too.
> >
> > 2. Failing applications currently tear down the dvm.
> >Ideally that would not be the case, and this would be handled in
> relation to item (3).
> >Possibly this needs to be configurable, if others would like to see
> different behaviour.
> >
> > 3. orte-submit doesn't return the exit code of the application.
> >
> > To be clear, I realise the current implementation is a proof of concept,
> so these are no complaints, just wishes of where I hope to see this going!
> >
> > FWIW: these items might require less intricate knowledge of OMPI in
> general, so with some pointers/guidance I can probably work on these myself
> if needed.
> >
> > Cheers,
> >
> > Mark
> >
> > ps. I did a quick-and-dirty integration with our own tool and the ORTE
> abstraction maps like a charm!
> > (
> https://github.com/radical-cybertools/radical.pilot/commit/2d36e886081bf8531097edfc95ada1826257e460
> )
> >
> > > On 03 Feb 2015, at 20:38 , Mark Santcroos 
> wrote:
> > >
> > > Hi Ralph,
> > >
> > >> On 03 Feb 2015, at 16:28 , Ralph Castain  wrote:
> > >> I think I fixed some of the handshake issues - please give it another
> try.
> > >> You should see orte-submit properly shutdown upon completion,
> > >
> > > Indeed, it works on my laptop now! Great!
> > > It feels quite fast too, for sort tasks :-)
> > >
> > >> and orte-dvm properly shutdown when sent the terminate cmd.
> > >
> > > ACK. This also works as expected.
> > >
> > >> I was able to cleanly run MPI jobs on my laptop.
> > >
> > > Do you also see the following errors/warnings on the dvm side?
> > >
> > > [netbook:28324] [[20896,0],0] Releasing job data for [INVALID]
> > > Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI
> mark@netbook Distribution, ident: 1.9.0a1, repo rev: dev-811-g7299cc3,
> Unreleased developer copy, 132)
> > > [netbook:28324] sess_dir_finalize: proc session dir does not exist
> > > [netbook:28324] [[20896,0],0] dvm: job [20896,20] has completed
> > > [netbook:28324] [[20896,0],0] Releasing job data for [20896,20]
> > >
> > > The "INVALID" message is there for every "submit", the
> sess_dir_finalize exists per instance/core.
> > > Is that something to worry about, that needs fixing or is that a
> configuration issue?
> > >
> > > I haven't been able to test on Edison because of maintenance
> (today+tomorrow), so I will report on that later.
> > >
> > > Thanks again!
> > >
> > > Mark
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/02/26282.php
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/02/26284.php
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/02/26289.php
>


Re: [OMPI users] independent startup of orted and orterun

2015-02-04 Thread Mark Santcroos
Ok great, sounds like a plan!

> On 04 Feb 2015, at 2:53 , Ralph Castain  wrote:
> 
> Appreciate your patience! I'm somewhat limited this week by being on travel 
> to our HQ, so I don't have access to my usual test cluster. I'll be better 
> situated to complete the implementation once I get home.
> 
> For now, some quick thoughts:
> 
> 1. stdout/stderr: yes, I just need to "register" orte-submit as the one to 
> receive those from the submitted job.
> 
> 2. That one is going to be a tad trickier, but is resolvable. May take me a 
> little longer to fix.
> 
> 3. dang - I thought I had it doing so. I'll look to find the issue. I suspect 
> it's just a case of correctly setting the return code of orte-submit.
> 
> I'd welcome the help! Let me ponder the best way to point you to the areas 
> needing work, and we can kick around off-list about who does what.
> 
> Great to hear this is working with your tool so quickly!!
> Ralph
> 
> 
> On Tue, Feb 3, 2015 at 3:49 PM, Mark Santcroos  
> wrote:
> Hi Ralph,
> 
> Besides the items in the other mail, I have three more items that would need 
> resolving at some point.
> 
> 1. STDOUT/STDERR currently go to the orte-dvm console.
>I'm sure this is not a fundamental limitation.
>Even if getting the information to the orte-submit instance would be 
> problematic, the orte-dvm writing this to a file per session would be good 
> enough too.
> 
> 2. Failing applications currently tear down the dvm.
>Ideally that would not be the case, and this would be handled in relation 
> to item (3).
>Possibly this needs to be configurable, if others would like to see 
> different behaviour.
> 
> 3. orte-submit doesn't return the exit code of the application.
> 
> To be clear, I realise the current implementation is a proof of concept, so 
> these are no complaints, just wishes of where I hope to see this going!
> 
> FWIW: these items might require less intricate knowledge of OMPI in general, 
> so with some pointers/guidance I can probably work on these myself if needed.
> 
> Cheers,
> 
> Mark
> 
> ps. I did a quick-and-dirty integration with our own tool and the ORTE 
> abstraction maps like a charm!
> 
> (https://github.com/radical-cybertools/radical.pilot/commit/2d36e886081bf8531097edfc95ada1826257e460)
> 
> > On 03 Feb 2015, at 20:38 , Mark Santcroos  
> > wrote:
> >
> > Hi Ralph,
> >
> >> On 03 Feb 2015, at 16:28 , Ralph Castain  wrote:
> >> I think I fixed some of the handshake issues - please give it another try.
> >> You should see orte-submit properly shutdown upon completion,
> >
> > Indeed, it works on my laptop now! Great!
> > It feels quite fast too, for sort tasks :-)
> >
> >> and orte-dvm properly shutdown when sent the terminate cmd.
> >
> > ACK. This also works as expected.
> >
> >> I was able to cleanly run MPI jobs on my laptop.
> >
> > Do you also see the following errors/warnings on the dvm side?
> >
> > [netbook:28324] [[20896,0],0] Releasing job data for [INVALID]
> > Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI mark@netbook 
> > Distribution, ident: 1.9.0a1, repo rev: dev-811-g7299cc3, Unreleased 
> > developer copy, 132)
> > [netbook:28324] sess_dir_finalize: proc session dir does not exist
> > [netbook:28324] [[20896,0],0] dvm: job [20896,20] has completed
> > [netbook:28324] [[20896,0],0] Releasing job data for [20896,20]
> >
> > The "INVALID" message is there for every "submit", the sess_dir_finalize 
> > exists per instance/core.
> > Is that something to worry about, that needs fixing or is that a 
> > configuration issue?
> >
> > I haven't been able to test on Edison because of maintenance 
> > (today+tomorrow), so I will report on that later.
> >
> > Thanks again!
> >
> > Mark
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26282.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26284.php