Re: [OMPI users] error polling LP CQ with status RETRYEXCEEDED ERROR

Jeff Squyres Mon, 30 Mar 2009 21:13:39 -0400

On Mar 27, 2009, at 11:22 AM, Gary Draving wrote:

Thanks for the advice, we tried  "-mca btl_openib_ib_min_rnr_timer 25
-mca btl_openib_ib_timeout 20" but we are still getting errors as we
increase the Ns of HPL.dat value into the thousands. Is it ok to just
add these valuse to .openmpi/mca-params.conf for the user running the
test or should we add these setting to each node in
/usr/local/etc/openmpi-mca-params.conf

It would be better to put them in the /usr/local/... file so that allyour users get those values without needing to do anything.

The OpenFabrics stack has reported a network error event.  Open MPI
will try to continue, but your job may end up failing.

  Local host:        compute-0-8.local
  MPI process PID:   30544
  Error number:      10 (IBV_EVENT_PORT_ERR)

This error may indicate connectivity problems within the fabric;
please contact your system administrator.

This is different than your prior error -- it may indicate a problemwith your IB fabric itself. As such, I think increasing the timervalues fixed the RER problem, but then this [new] error showed up.




Ralph Castain wrote:
> The default retry values are wrong and will be corrected in the next
> OMPI release. For now, try running with:
>
> -mca btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20
>
> Should work.
> Ralph
>
> On Mar 26, 2009, at 2:16 PM, Gary Draving wrote:
>
>> Hi Everyone,
>>

>> I'm doing some performance testing using HPL with TCP turnedoff. My

>> HPL.dat file looks like the following:
>> It seems to work well for lower Ns values but as I increase that
>> value it inevitably fails with
>> "[[13535,1],169][btl_openib_component.c:2905:handle_wc] from
>> compute-0-0.local to: compute-0-8 error polling LP CQ with status
>> RETRY EXCEEDED ERROR status number 12 for wr_id 209907960 opcode 0
>> qp_idx 3"
>>
>> The machines in this test are all dual core quads with built in
>> mellanox cards for total of 320 processors
>>
>> It seems like once I reach a certain nuber of "Ns" the infiniban
>> starts having problems.  Increasing "btl_openib_ib_retry_count" and
>> "btl_openib_ib_timeout" to the max allowed us to get from 4096 to
>> 8192 Ns but the error came back at around 8192.
>>
>> If anyone has some ideas on this problem I would be very interests,
>> Thanks
>>
>> ((((((((((((((((((HPL.dat file being uses )))))))))))))))))))
>>
>> HPLinpack benchmark input file
>> Innovative Computing Laboratory, University of Tennessee
>> HPL.out      output file name (if any)
>> 6            device out (6=stdout,7=stderr,file)
>> 1            # of problems sizes (N)
>> 8192        Ns
>> 1            # of NBs
>> 256          NBs
>> 0            PMAP process mapping (0=Row-,1=Column-major)
>> 1            # of process grids (P x Q)
>> 19           Ps
>> 19           Qs
>> (defaults for rest)
>>
>> (((((((((((((((((( Full error printout ))))))))))))))))))
>>
>> [[13535,1],169][btl_openib_component.c:2905:handle_wc] from
>> compute-0-0.local to: compute-0-8 error polling LP CQ with status
>> RETRY EXCEEDED ERROR status number 12 for wr_id 209907960 opcode 0
>> qp_idx 3

>>--------------------------------------------------------------------------

>>
>> The InfiniBand retry count between two MPI processes has been
>> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
>> (section 12.7.38):
>>
>>   The total number of times that the sender wishes the receiver to
>>   retry timeout, packet sequence, etc. errors before posting a
>>   completion error.
>>
>> This error typically means that there is something awry within the
>> InfiniBand fabric itself.  You should note the hosts on which this

>> error has occurred; it has been observed that rebooting orremoving a

>> particular host from the job can sometimes resolve this issue.
>> Two MCA parameters can be used to control Open MPI's behavior with
>> respect to the retry count:
>>
>> * btl_openib_ib_retry_count - The number of times the sender will
>> attempt to retry (defaulted to 7, the maximum value).

>> * btl_openib_ib_timeout - The local ACK timeout parameter(defaulted

>> to 10).  The actual timeout value used is calculated as:
>>
>>    4.096 microseconds * (2^btl_openib_ib_timeout)
>>
>> See the InfiniBand spec 1.2 (section 12.7.34) for more details.
>>

>> Below is some information about the host that raised the errorand the

>> peer to which it was connected:
>>
>> Local host:   compute-0-0.local
>> Local device: mthca0
>> Peer host:    compute-0-8
>>
>> You may need to consult with your system administrator to get this
>> problem fixed.

>>--------------------------------------------------------------------------

>>

>>--------------------------------------------------------------------------

>>
>> mpirun has exited due to process rank 169 with PID 26725 on
>> node compute-0-0 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).

>>--------------------------------------------------------------------------

>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

Re: [OMPI users] error polling LP CQ with status RETRYEXCEEDED ERROR

Reply via email to