I recently had access to a 200+ node Magny Cours (24 ranks/host) 10G Linux 
cluster.  I was able to use OpenMPI v1.5.4 with hello world, IMB and HPCC, but 
there were a couple of issues along the way.  After setting some system 
tunables up a little bit on all of the nodes a hello_world program worked just 
fine - it appears that the TCP connections between most or all of the ranks are 
deferred until they are actually used so the easy test ran reasonably quickly.  
I then moved to IMB.

I typically don't care about the small rank counts, so I add the -npmin 99999 
option to just run the 'big' number of ranks.  This ended with an abort after 
MPI_Init(), but before running any tests.  Lots (possibly all) of ranks emitted 
messages that looked like:

    
'[n112][[13200,1],1858][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
 connect() to 172.23.4.1 failed: Connection timed out (110)'

Where n112 is one of the nodes in the job, and 172.23.4.1 is the first node in 
the job.  One of the first things that IMB does before running a test is create 
a communicator for each specific rank count it is testing.  Apparently this 
collective operation causes a large number of connections to be made.  The 
abort messages (one example shown above) all show the connect failure to a 
single node, so it would appear that a very large number of nodes attempt to 
connect to that one at the same time and overwhelmed it.  (Or it was slow and 
everyone ganged up on it as they worked their way around the ring.  :)  Is 
there a supported/suggested way to work around this?  It was very repeatable.

I was able to work around this by using the primary definitions for MPI_Init() 
and MPI_Init_thread() by calling the 'P' version of the routine, and then 
having each rank send its rank number to the rank one to the right, then two to 
the right, and so-on around the ring.  I added a MPI_Barrier( MPI_COMM_WORLD ), 
call every N messages to keep things at a controlled pace.  N was 64 by 
default, but settable via environment variable in case that number didn't work 
well for some reason.  This fully connected the mesh (110k socket connections 
per host!) and allowed the tests to run.  Not a great solution, I know, but 
I'll throw it out there until I know the right way.

Once I had this in place, I used the workaround with HPCC as well.  Without it, 
it would not get very far at all.  With it, I was able to make it through the 
entire test.

Looking forward to getting the experts thoughts about the best way to handle 
big TCP clusters - thanks!

Brent

P.S.  v1.5.4 worked *much* better that v1.4.3 on this cluster - not sure why, 
but kudos to those working on changes since then!

Reply via email to