What version of OMPI are you using? The job should terminate in either case - what did you do to keep it running after node failure with tcp?
On Sep 23, 2011, at 12:34 PM, Guilherme V wrote: > Hi, > I want to know if anybody is having problems with fault tolerant job using > infiniband. When I run my job with tcp if anything happens with one node, my > job keeps running, but if I change my job to use infiniband if anything > happens with the infiniband (i.e cable problems) my job fails. > > Anybody knows if there is something different that need to be done when using > openib instead tcp? > > Bellow a example of the message I'm receiving from the mpi. > > Regards, > Guilherme > > -------------------------------------------------------------------------- > > > The OpenFabrics stack has reported a network error event. Open MPI > > > will try to continue, but your job may end up failing. > > > > Local host: XXXXX > MPI process PID: 23341 > Error number: 10 (IBV_EVENT_PORT_ERR) > > This error may indicate connectivity problems within the fabric; > please contact your system administrator. > -------------------------------------------------------------------------- > [ZZZZ:23320] 15 more processes have sent help message help-mpi-btl-openib.txt > / of error event > [WWW:23320] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help > / error messages > [[4089,1],144][btl_openib_component.c:3227:handle_wc] from XXXXX to: YYYYY > error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for > wr_id 214283560 opcode 51 vendor error 129 qp_idx 3 > [[4089,1],147][btl_openib_component.c:3227:handle_wc] from XXXXX to: YYYYY > error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for > wr_id 490884096 opcode 1 vendor error 129 qp_idx 0 > -------------------------------------------------------------------------- > The InfiniBand retry count between two MPI processes has been > exceeded. "Retry count" is defined in the InfiniBand spec 1.2 > (section 12.7.38): > > The total number of times that the sender wishes the receiver to > retry timeout, packet sequence, etc. errors before posting a > completion error. > > This error typically means that there is something awry within the > InfiniBand fabric itself. You should note the hosts on which this > error has occurred; it has been observed that rebooting or removing a > particular host from the job can sometimes resolve this issue. > > Two MCA parameters can be used to control Open MPI's behavior with > respect to the retry count: > > * btl_openib_ib_retry_count - The number of times the sender will > attempt to retry (defaulted to 7, the maximum value). > * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted > to 10). The actual timeout value used is calculated as: > > 4.096 microseconds * (2^btl_openib_ib_timeout) > > See the InfiniBand spec 1.2 (section 12.7.34) for more details. > > Below is some information about the host that raised the error and the > peer to which it was connected: > > Local host: XXXX > Local device: mlx4_0 > Peer host: YYYY > > You may need to consult with your system administrator to get this > problem fixed. > -------------------------------------------------------------------------- > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users