Gus -- Can you try v1.4.2 which was just released today? On May 4, 2010, at 4:18 PM, Gus Correa wrote:
> Hi Ralph > > Thank you very much. > The "-mca btl ^sm" workaround seems to have solved the problem, > at least for the little hello_c.c test. > I just ran it fine up to 128 processes. > > I confess I am puzzled by this workaround. > * Why should we turn off "sm" in a standalone machine, > where everything is supposed to operate via shared memory? > * Do I incur in a performance penalty by not using "sm"? > * What other mechanism is actually used by OpenMPI for process > communication in this case? > > It seems to be using tcp, because when I try -np 256 I get this error: > > [spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit on number > of network connections a process can open was reached in file > ../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447 > -------------------------------------------------------------------------- > Error: system limit exceeded on number of network connections that can > be open > This can be resolved by setting the mca parameter > opal_set_max_sys_limits to 1, > increasing your limit descriptor setting (using limit or ulimit commands), > or asking the system administrator to increase the system limit. > -------------------------------------------------------------------------- > > Anyway, no big deal, because we don't intend to oversubrcribe the > processors on real jobs anyway (and the very error message suggests a > workaround to increase np, if needed). > > Many thanks, > Gus Correa > > Ralph Castain wrote: > > I would certainly try it -mca btl ^sm and see if that solves the problem. > > > > On May 4, 2010, at 2:38 PM, Eugene Loh wrote: > > > >> Gus Correa wrote: > >> > >>> Dear Open MPI experts > >>> > >>> I need your help to get Open MPI right on a standalone > >>> machine with Nehalem processors. > >>> > >>> How to tweak the mca parameters to avoid problems > >>> with Nehalem (and perhaps AMD processors also), > >>> where MPI programs hang, was discussed here before. > >>> > >>> However, I lost track of the details, how to work around the problem, > >>> and if it was fully fixed already perhaps. > >> Yes, perhaps the problem you're seeing is not what you remember being > >> discussed. > >> > >> Perhaps you're thinking of https://svn.open-mpi.org/trac/ompi/ticket/2043 > >> . It's presumably fixed. > >> > >>> I am now facing the problem directly on a single Nehalem box. > >>> > >>> I installed OpenMPI 1.4.1 from source, > >>> and compiled the test hello_c.c with mpicc. > >>> Then I tried to run it with: > >>> > >>> 1) mpirun -np 4 a.out > >>> It ran OK (but seemed to be slow). > >>> > >>> 2) mpirun -np 16 a.out > >>> It hung, and brought the machine to a halt. > >>> > >>> Any words of wisdom are appreciated. > >>> > >>> More info: > >>> > >>> * OpenMPI 1.4.1 installed from source (tarball from your site). > >>> * Compilers are gcc/g++/gfortran 4.4.3-4. > >>> * OS is Fedora Core 12. > >>> * The machine is a Dell box with Intel Xeon 5540 (quad core) > >>> processors on a two-way motherboard and 48GB of RAM. > >>> * /proc/cpuinfo indicates that hyperthreading is turned on. > >>> (I can see 16 "processors".) > >>> > >>> ** > >>> > >>> What should I do? > >>> > >>> Use -mca btl ^sm ? > >>> Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?) > >>> Use Both? > >>> Do something else? > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/