On Mar 27, 2009, at 11:22 AM, Gary Draving wrote:

Thanks for the advice, we tried  "-mca btl_openib_ib_min_rnr_timer 25
-mca btl_openib_ib_timeout 20" but we are still getting errors as we
increase the Ns of HPL.dat value into the thousands. Is it ok to just
add these valuse to .openmpi/mca-params.conf for the user running the
test or should we add these setting to each node in
/usr/local/etc/openmpi-mca-params.conf


It would be better to put them in the /usr/local/... file so that all your users get those values without needing to do anything.

The OpenFabrics stack has reported a network error event.  Open MPI
will try to continue, but your job may end up failing.

  Local host:        compute-0-8.local
  MPI process PID:   30544
  Error number:      10 (IBV_EVENT_PORT_ERR)

This error may indicate connectivity problems within the fabric;
please contact your system administrator.


This is different than your prior error -- it may indicate a problem with your IB fabric itself. As such, I think increasing the timer values fixed the RER problem, but then this [new] error showed up.




Ralph Castain wrote:
> The default retry values are wrong and will be corrected in the next
> OMPI release. For now, try running with:
>
> -mca btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20
>
> Should work.
> Ralph
>
> On Mar 26, 2009, at 2:16 PM, Gary Draving wrote:
>
>> Hi Everyone,
>>
>> I'm doing some performance testing using HPL with TCP turned off. My
>> HPL.dat file looks like the following:
>> It seems to work well for lower Ns values but as I increase that
>> value it inevitably fails with
>> "[[13535,1],169][btl_openib_component.c:2905:handle_wc] from
>> compute-0-0.local to: compute-0-8 error polling LP CQ with status
>> RETRY EXCEEDED ERROR status number 12 for wr_id 209907960 opcode 0
>> qp_idx 3"
>>
>> The machines in this test are all dual core quads with built in
>> mellanox cards for total of 320 processors
>>
>> It seems like once I reach a certain nuber of "Ns" the infiniban
>> starts having problems.  Increasing "btl_openib_ib_retry_count" and
>> "btl_openib_ib_timeout" to the max allowed us to get from 4096 to
>> 8192 Ns but the error came back at around 8192.
>>
>> If anyone has some ideas on this problem I would be very interests,
>> Thanks
>>
>> ((((((((((((((((((HPL.dat file being uses )))))))))))))))))))
>>
>> HPLinpack benchmark input file
>> Innovative Computing Laboratory, University of Tennessee
>> HPL.out      output file name (if any)
>> 6            device out (6=stdout,7=stderr,file)
>> 1            # of problems sizes (N)
>> 8192        Ns
>> 1            # of NBs
>> 256          NBs
>> 0            PMAP process mapping (0=Row-,1=Column-major)
>> 1            # of process grids (P x Q)
>> 19           Ps
>> 19           Qs
>> (defaults for rest)
>>
>> (((((((((((((((((( Full error printout ))))))))))))))))))
>>
>> [[13535,1],169][btl_openib_component.c:2905:handle_wc] from
>> compute-0-0.local to: compute-0-8 error polling LP CQ with status
>> RETRY EXCEEDED ERROR status number 12 for wr_id 209907960 opcode 0
>> qp_idx 3
>> --------------------------------------------------------------------------
>>
>> The InfiniBand retry count between two MPI processes has been
>> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
>> (section 12.7.38):
>>
>>   The total number of times that the sender wishes the receiver to
>>   retry timeout, packet sequence, etc. errors before posting a
>>   completion error.
>>
>> This error typically means that there is something awry within the
>> InfiniBand fabric itself.  You should note the hosts on which this
>> error has occurred; it has been observed that rebooting or removing a
>> particular host from the job can sometimes resolve this issue.
>> Two MCA parameters can be used to control Open MPI's behavior with
>> respect to the retry count:
>>
>> * btl_openib_ib_retry_count - The number of times the sender will
>> attempt to retry (defaulted to 7, the maximum value).
>> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>> to 10).  The actual timeout value used is calculated as:
>>
>>    4.096 microseconds * (2^btl_openib_ib_timeout)
>>
>> See the InfiniBand spec 1.2 (section 12.7.34) for more details.
>>
>> Below is some information about the host that raised the error and the
>> peer to which it was connected:
>>
>> Local host:   compute-0-0.local
>> Local device: mthca0
>> Peer host:    compute-0-8
>>
>> You may need to consult with your system administrator to get this
>> problem fixed.
>> --------------------------------------------------------------------------
>>
>> --------------------------------------------------------------------------
>>
>> mpirun has exited due to process rank 169 with PID 26725 on
>> node compute-0-0 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to