Does anyone have issues with jobs dying with errors: > The InfiniBand retry count between two MPI processes has been > exceeded. "Retry count" is defined in the InfiniBand spec 1.2 > (section 12.7.38):
We started seeing this about a year ago. If we have changes to the IB fabric, this can happen. Multiple times now when just plugging in line cards to switches on a live system causes large swaths of jobs to die with this error. Does anyone else have this problem? We are a Mellonox based fabric. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985