maybe it's related to https://svn.open-mpi.org/trac/ompi/ticket/1378 ??
On 12/5/08, Justin <luitj...@cs.utah.edu> wrote: > > The reason i'd like to disable these eager buffers is to help detect the > deadlock better. I would not run with this for a normal run but it would be > useful for debugging. If the deadlock is indeed due to our code then > disabling any shared buffers or eager sends would make that deadlock > reproduceable. In addition we might be able to lower the number of > processors down. Right now determining which processor is deadlocks when we > are using 8K cores and each processor has hundreds of messages sent out > would be quite difficult. > > Thanks for your suggestions, > Justin > Brock Palen wrote: > >> OpenMPI has differnt eager limits for all the network types, on your >> system run: >> >> ompi_info --param btl all >> >> and look for the eager_limits >> You can set these values to 0 using the syntax I showed you before. That >> would disable eager messages. >> There might be a better way to disable eager messages. >> Not sure why you would want to disable them, they are there for >> performance. >> >> Maybe you would still see a deadlock if every message was below the >> threshold. I think there is a limit of the number of eager messages a >> receving cpus will accept. Not sure about that though. I still kind of >> doubt it though. >> >> Try tweaking your buffer sizes, make the openib btl eager limit the same >> as shared memory. and see if you get locks up between hosts and not just >> shared memory. >> >> Brock Palen >> www.umich.edu/~brockp >> Center for Advanced Computing >> bro...@umich.edu >> (734)936-1985 >> >> >> >> On Dec 5, 2008, at 2:10 PM, Justin wrote: >> >> Thank you for this info. I should add that our code tends to post a lot >>> of sends prior to the other side posting receives. This causes a lot of >>> unexpected messages to exist. Our code explicitly matches up all tags and >>> processors (that is we do not use MPI wild cards). If we had a dead lock I >>> would think we would see it regardless of weather or not we cross the >>> roundevous threshold. I guess one way to test this would be to to set this >>> threshold to 0. If it then dead locks we would likely be able to track down >>> the deadlock. Are there any other parameters we can send mpi that will turn >>> off buffering? >>> >>> Thanks, >>> Justin >>> >>> Brock Palen wrote: >>> >>>> When ever this happens we found the code to have a deadlock. users >>>> never saw it until they cross the eager->roundevous threshold. >>>> >>>> Yes you can disable shared memory with: >>>> >>>> mpirun --mca btl ^sm >>>> >>>> Or you can try increasing the eager limit. >>>> >>>> ompi_info --param btl sm >>>> >>>> MCA btl: parameter "btl_sm_eager_limit" (current value: >>>> "4096") >>>> >>>> You can modify this limit at run time, I think (can't test it right >>>> now) it is just: >>>> >>>> mpirun --mca btl_sm_eager_limit 40960 >>>> >>>> I think you can also in tweaking these values use env Vars in place of >>>> putting it all in the mpirun line: >>>> >>>> export OMPI_MCA_btl_sm_eager_limit=40960 >>>> >>>> See: >>>> http://www.open-mpi.org/faq/?category=tuning >>>> >>>> >>>> Brock Palen >>>> www.umich.edu/~brockp >>>> Center for Advanced Computing >>>> bro...@umich.edu >>>> (734)936-1985 >>>> >>>> >>>> >>>> On Dec 5, 2008, at 12:22 PM, Justin wrote: >>>> >>>> Hi, >>>>> >>>>> We are currently using OpenMPI 1.3 on Ranger for large processor jobs >>>>> (8K+). Our code appears to be occasionally deadlocking at random within >>>>> point to point communication (see stacktrace below). This code has been >>>>> tested on many different MPI versions and as far as we know it does not >>>>> contain a deadlock. However, in the past we have ran into problems with >>>>> shared memory optimizations within MPI causing deadlocks. We can usually >>>>> avoid these by setting a few environment variables to either increase the >>>>> size of shared memory buffers or disable shared memory optimizations all >>>>> together. Does OpenMPI have any known deadlocks that might be causing >>>>> our >>>>> deadlocks? If are there any work arounds? Also how do we disable shared >>>>> memory within OpenMPI? >>>>> >>>>> Here is an example of where processors are hanging: >>>>> >>>>> #0 0x00002b2df3522683 in mca_btl_sm_component_progress () from >>>>> /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_btl_sm.so >>>>> #1 0x00002b2df2cb46bf in mca_bml_r2_progress () from >>>>> /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_bml_r2.so >>>>> #2 0x00002b2df0032ea4 in opal_progress () from >>>>> /opt/apps/intel10_1/openmpi/1.3/lib/libopen-pal.so.0 >>>>> #3 0x00002b2ded0d7622 in ompi_request_default_wait_some () from >>>>> /opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0 >>>>> #4 0x00002b2ded109e34 in PMPI_Waitsome () from >>>>> /opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0 >>>>> >>>>> >>>>> Thanks, >>>>> Justin >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >