Try running with: mpirun.openmpi-1.4.1 --mca btl_base_verbose 50 --mca btl self,openib -n 2 --mca btl_openib_verbose 100 ./IMB-MPI1 -npmin 2 PingPong
Also, are you saying that running the same command line with osu_latency works just fine? That would be really weird... On May 18, 2010, at 6:18 AM, Peter Kruse wrote: > Hello, > > trying to run Intel MPI Benchmarks with OpenMPI 1.4.1 fails in initializing > the component openib. System is Debian GNU/Linux 5.0.4. > The command to start the job (under Torque 2.4.7) was: > > mpirun.openmpi-1.4.1 --mca btl_base_verbose 50 --mca btl self,openib -n 2 > ./IMB-MPI1 -npmin 2 PingPong > > and results in these messages: > > ----------------------------8<---------------------------------------------- > > [beo-15:20933] mca: base: components_open: Looking for btl components > [beo-16:20605] mca: base: components_open: Looking for btl components > [beo-15:20933] mca: base: components_open: opening btl components > [beo-15:20933] mca: base: components_open: found loaded component openib > [beo-15:20933] mca: base: components_open: component openib has no register > function > [beo-15:20933] mca: base: components_open: component openib open function > successful > [beo-15:20933] mca: base: components_open: found loaded component self > [beo-15:20933] mca: base: components_open: component self has no register > function > [beo-15:20933] mca: base: components_open: component self open function > successful > [beo-16:20605] mca: base: components_open: opening btl components > [beo-16:20605] mca: base: components_open: found loaded component openib > [beo-16:20605] mca: base: components_open: component openib has no register > function > [beo-16:20605] mca: base: components_open: component openib open function > successful > [beo-16:20605] mca: base: components_open: found loaded component self > [beo-16:20605] mca: base: components_open: component self has no register > function > [beo-16:20605] mca: base: components_open: component self open function > successful > [beo-15:20933] select: initializing btl component openib > [beo-15:20933] select: init of component openib returned failure > [beo-15:20933] select: module openib unloaded > [beo-15:20933] select: initializing btl component self > [beo-15:20933] select: init of component self returned success > [beo-16:20605] select: initializing btl component openib > [beo-16:20605] select: init of component openib returned failure > [beo-16:20605] select: module openib unloaded > [beo-16:20605] select: initializing btl component self > [beo-16:20605] select: init of component self returned success > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[4887,1],0]) is on host: beo-15 > Process 2 ([[4887,1],1]) is on host: beo-16 > BTLs attempted: self > > Your MPI job is now going to abort; sorry. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > PML add procs failed > --> Returned "Unreachable" (-12) instead of "Success" (0) > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init_thread > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [beo-15:20933] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > -------------------------------------------------------------------------- > orterun has exited due to process rank 0 with PID 20933 on > node beo-15 exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by orterun (as reported here). > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init_thread > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [beo-16:20605] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > [beo-15:20930] 1 more process has sent help message help-mca-bml-r2.txt / > unreachable proc > [beo-15:20930] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > [beo-15:20930] 1 more process has sent help message help-mpi-runtime / > mpi_init:startup:internal-failure > > ----------------------------8<---------------------------------------------- > > running another Benchmark (OSU) succeeds in loading the openib component. > > "ibstat |grep -i state" on both nodes gives: > > ----------------------------8<---------------------------------------------- > State: Active > Physical state: LinkUp > ----------------------------8<---------------------------------------------- > > Running with "mpi_abort_delay -1" and attaching an strace on the process > is not very helpful it loops with: > > ----------------------------8<---------------------------------------------- > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > rt_sigaction(SIGCHLD, NULL, {0x2aee58ff3250, [CHLD], SA_RESTORER|SA_RESTART, > 0x2aee59d44f60}, 8) = 0 > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > nanosleep({5, 0}, {5, 0}) = 0 > ----------------------------8<---------------------------------------------- > > Does anybody have an idea what is wrong or how can we get more debugging > information about the initialization of the openib module? > > Thanks for any help, > > Peter > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/