[OMPI users] random IB failures when running medium core counts

2010-08-30 Thread Brock Palen
We recently installed a modest IB network to our cluster, When running a 1884 core IB HPL job after a run we will get an error about IB, it does not always happen in the same place, some iterations will pass others will fail the error is below, we are using openmpi/1.4.2 with the intel 11

Re: [OMPI users] random IB failures when running medium core counts

2010-08-30 Thread Joshua Bernstein
Hello Brock, While it doesn't solve the problem, have you tried increasing the btl timeouts like the message suggest? With 1884 cores in use perhaps there is some over subscription in the fabric? -Joshua Bernstein Penguin Computing Brock Palen wrote: We recently installed a modest IB

[OMPI users] AUTO: Richard Treumann/Poughkeepsie/IBM is out of the office until 01/02/2001. (returning 09/07/2010)

2010-08-30 Thread Richard Treumann
I am out of the office until 09/07/2010. I will be out of the office on vacation the week before Labor Day. I will not see any email. Note: This is an automated response to your message "[OMPI users] random IB failures when running medium core counts" sent on 8/30/10 12:22:19. This is the

[OMPI users] compiler upgrades require openmpi rebuild?

2010-08-30 Thread David Turner
Hi, We have recently upgraded our default compiler suite from PGI 10.5 to PGI 10.8. We use the "module" system to manage third-party software. The module for PGI sets PATH and LD_LIBRARY_PATH. Using Open MPI 1.4.2, built with PGI 10.5, I have verified that changing PATH is sufficient for the