Would it be possible to get a backtrace from one of the crashes? It would be especially helpful if you can add --enable-debug to the OMPI config.
On Wed, Apr 1, 2015 at 1:09 PM, Thomas Klimpel <jacques.gent...@gmail.com> wrote: > > You might double-check by running with "--mca btl ^openib" to see if > that is the source of the warning > > The warning appears always, independent of the interconnect, and even when > running with "--mca btl ^openib". > > > > Does it only crash when you pause it? Or does it crash while normally > running? > > It is very hard to reproduce without pause. It only crashes 1 out of 5 > after half an hour for a run which would take 36 hours. Smaller test cases > seem to never crash on their own, but when I pause, even quite small test > cases (less than a minute) crash, if I have more than 72 workers. > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26593.php >