Re: [OMPI users] [petsc-maint] Deadlock in OpenMPI 1.8.3 and PETSc 3.4.5

2015-02-23 Thread Jed Brown
"Jeff Squyres (jsquyres)" writes: > This is, unfortunately, an undefined area of the MPI specification. I > do believe that our previous behavior was *correct* -- it just > deadlocks with PETSC because PETSC is relying on undefined behavior. Jeff, can you clarify where in the standard this is le

Re: [OMPI users] Several Bcast calls in a loop causing the code to hang

2015-02-23 Thread George Bosilca
On Mon, Feb 23, 2015 at 4:37 PM, Joshua Ladd wrote: > Nathan, > > I do, but the hang comes later on. It looks like it's a situation where > the root is way, way faster than the children and he's inducing an an > overrun in the unexpected message queue. I think the queue is set to just > keep grow

Re: [OMPI users] Several Bcast calls in a loop causing the code to hang

2015-02-23 Thread Joshua Ladd
Nathan, I do, but the hang comes later on. It looks like it's a situation where the root is way, way faster than the children and he's inducing an an overrun in the unexpected message queue. I think the queue is set to just keep growing and it eventually blows up the memory?? $/hpc/mtl_scrap/user

Re: [OMPI users] Several Bcast calls in a loop causing the code to hang

2015-02-23 Thread Nathan Hjelm
Josh, do you see a hang when using vader? It is preferred over the old sm btl. -Nathan On Mon, Feb 23, 2015 at 03:48:17PM -0500, Joshua Ladd wrote: >Sachin, > >I am able to reproduce something funny. Looks like your issue. When I run >on a single host with two ranks, the test works

Re: [OMPI users] Several Bcast calls in a loop causing the code to hang

2015-02-23 Thread Joshua Ladd
Sachin, I am able to reproduce something funny. Looks like your issue. When I run on a single host with two ranks, the test works fine. However, when I try three or more, it looks like only the root, rank 0, is making any progress after the first iteration. $/hpc/mtl_scrap/users/joshual/openmpi-1

Re: [OMPI users] Slave machine shutdown

2015-02-23 Thread Ralph Castain
What version of OMPI are you using, and how was it configured? How was the job started? > On Feb 23, 2015, at 8:26 AM, Aleix Gimeno Vives wrote: > > Hello Ralph, > > The job is still running though and I used the default options. Would you > recommend me to run the job again? (the job will t

Re: [OMPI users] Slave machine shutdown

2015-02-23 Thread Aleix Gimeno Vives
Hello Ralph, The job is still running though and I used the default options. Would you recommend me to run the job again? (the job will take several days, so I'd rather not run it again if possible). Regards, Aleix 2015-02-23 17:20 GMT+01:00 Ralph Castain : > I would have expected the job to a

Re: [OMPI users] Slave machine shutdown

2015-02-23 Thread Ralph Castain
I would have expected the job to automatically abort if any processes were located on the slave that shut down - that is the default behavior. > On Feb 23, 2015, at 8:07 AM, Aleix Gimeno Vives wrote: > > Dear Open MPI support team, > > I am running a program using 1 master machine and 4 slave

[OMPI users] Slave machine shutdown

2015-02-23 Thread Aleix Gimeno Vives
Dear Open MPI support team, I am running a program using 1 master machine and 4 slaves, but one of the slaves was shut down. Will this have any influence in the output? Should I restart the job? I know it is a simple question, but I couldn't find the answer in the "Open MPI FAQ" or the mailing

[OMPI users] Questions regarding MPI intercommunicators & collectives

2015-02-23 Thread Harald Servat
Hello list, we have several questions regarding calls to collectives using intercommunicators. In man for MPI_Bcast, there is a notice for the inter-communicator case that reads the text below our questions. If an I is an intercomunicator for communicattors C1={p1,p2,p3} and C2={p4,p5,p6