Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-15 Thread Jason Maldonis
Hi Gilles, I would like to be able to run on anywhere from 1-16 nodes. Let me explain our (mpi/parallelism) situation briefly for more context: We have a "master" job that needs MPI functionality. This master job is written in python (we use mpi4py). The master job then makes spawn calls out to

Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-15 Thread Gilles Gouaillardet
Jason, How many nodes are you running on ? Since you have an IB network, IB is used for intra node communication between tasks that are not part of the same OpenMPI job (read spawn group) I can make a simple patch to use tcp instead of IB for these intra node communication, Let me know if you are

Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-14 Thread Jason Maldonis
Thanks Ralph for all the help. I will do that until it gets fixed. Nathan, I am very very interested in this working because we are developing some new cool code for research in materials science. This is the last piece of the puzzle for us I believe. I can use TCP for now though of course. While

Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-14 Thread Ralph Castain
You don’t want to always use those options as your performance will take a hit - TCP vs Infiniband isn’t a good option. Sadly, this is something we need someone like Nathan to address as it is a bug in the code base, and in an area I’m not familiar with For now, just use TCP so you can move for

Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-14 Thread Jason Maldonis
Ralph, The problem *does* go away if I add "-mca btl tcp,sm,self" to the mpiexec cmd line. (By the way, I am using mpiexec rather than mpirun; do you recommend one over the other?) Will you tell me what this means for me? For example, should I always append these arguments to mpiexec for my non-tes

Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-14 Thread Nathan Hjelm
That message is coming from udcm in the openib btl. It indicates some sort of failure in the connection mechanism. It can happen if the listening thread no longer exists or is taking too long to process messages. -Nathan On Jun 14, 2016, at 12:20 PM, Ralph Castain wrote: Hmm…I’m unable to r

Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-14 Thread Ralph Castain
Hmm…I’m unable to replicate a problem on my machines. What fabric are you using? Does the problem go away if you add “-mca btl tcp,sm,self” to the mpirun cmd line? > On Jun 14, 2016, at 11:15 AM, Jason Maldonis wrote: > > Hi Ralph, et. al, > > Great, thank you for the help. I downloaded the m

Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-14 Thread Jason Maldonis
Hi Ralph, et. al, Great, thank you for the help. I downloaded the mpi loop spawn test directly from what I think is the master repo on github: https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c I am still using the mpi code from 1.10.2, however. Is that test updated with the

Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-14 Thread Ralph Castain
I dug into this a bit (with some help from others) and found that the spawn code appears to be working correctly - it is the test in orte/test that is wrong. The test has been correctly updated in the 2.x and master repos, but we failed to backport it to the 1.10 series. I have done so this morn

Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-13 Thread Ralph Castain
No, that PR has nothing to do with loop_spawn. I’ll try to take a look at the problem. > On Jun 13, 2016, at 3:47 PM, Jason Maldonis wrote: > > Hello, > > I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the spawn > functionality to work inside a for loop, but continue to get

[OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-13 Thread Jason Maldonis
Hello, I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the spawn functionality to work inside a for loop, but continue to get the error "too many retries sending message to , giving up" somewhere down the line in the for loop, seemingly because the processors are not being fully

[OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-13 Thread Jason Maldonis
Hello, I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the spawn functionality to work inside a for loop, but continue to get the error "too many retries sending message to , giving up" somewhere down the line in the for loop, seemingly because the processors are not being fully