Does anyone have issues with jobs dying with errors:

> The InfiniBand retry count between two MPI processes has been
> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):

We started seeing this about a year ago.  If we have changes to the IB fabric, 
this can happen.  Multiple times now when just plugging in line cards to 
switches on a live system causes large swaths of jobs to die with this error.

Does anyone else have this problem?  We are a Mellonox based fabric.

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



Reply via email to