Don't include udapl - that code may well be stale Sent from my iPhone
On Jun 23, 2013, at 3:42 AM, dani <d...@letai.org.il> wrote: > Hi, > > I've encountered strange issues when trying to run a simple mpi job on a > single host which has IB. > The complete errors: > >> -> mpirun -n 1 hello >> -------------------------------------------------------------------------- >> WARNING: Failed to open "ofa-v2-mlx4_0-1" >> [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED]. >> This may be a real error or it may be an invalid entry in the uDAPL >> Registry which is contained in the dat.conf file. Contact your local >> System Administrator to confirm the availability of the interfaces in >> the dat.conf file. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> [[53031,1],0]: A high-performance Open MPI point-to-point messaging module >> was unable to find any relevant network interfaces: >> >> Module: uDAPL >> Host: n01 >> >> Another transport will be used instead, although this may result in >> lower performance. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> WARNING: It appears that your OpenFabrics subsystem is configured to only >> allow registering part of your physical memory. This can cause MPI jobs to >> run with erratic performance, hang, and/or crash. >> >> This may be caused by your OpenFabrics vendor limiting the amount of >> physical memory that can be registered. You should investigate the >> relevant Linux kernel module parameters that control how much physical >> memory can be registered, and increase them to allow registering all >> physical memory on your machine. >> >> See this Open MPI FAQ item for more information on these Linux kernel module >> parameters: >> >> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages >> >> Local host: n01 >> Registerable memory: 32768 MiB >> Total memory: 65503 MiB >> >> Your MPI job will continue, but may be behave poorly and/or hang. >> -------------------------------------------------------------------------- >> Process 0 on n01 out of 1 >> [n01:13534] 7 more processes have sent help message help-mpi-btl-udapl.txt / >> dat_ia_open fail >> [n01:13534] Set MCA parameter "orte_base_help_aggregate" to 0 to see all >> help / error messages > Following my setup and other info: > OS: CentOS 6.3 x86_64 > installed ofed 3.5 from source ( ./install.pl --all) > installed openmpi 1.6.4 with the following build parameters: >> rpmbuild --rebuild openmpi-1.6.4-1.src.rpm --define '_prefix >> /opt/openmpi/1.6.4/gcc' --define '_defaultdocdir /opt/openmpi/1.6.4/gcc' >> --define '_mandir %{_prefix}/share/man' --define '_datadir %{_prefix}/share' >> --define 'configure_options --with-openib=/usr >> --with-openib-libdir=/usr/lib64 CC=gcc CXX=g++ F77=gfortran FC=gfortran >> --enable-mpirun-prefix-by-default --target=x86_64-unknown-linux-gnu >> --with-hwloc=/usr/local --with-libltdl --enable-branch-probabilities >> --with-udapl --with-sge --disable-vt' --define 'use_default_rpm_opt_flags 1' >> --define '_name openmpi-1.6.4_gcc' --define 'install_shell_scripts 1' >> --define 'shell_scripts_basename mpivars' --define '_usr /usr' --define >> 'ofed 0' 2>&1 | tee openmpi.build.sge > (disable -vt was used due to cuda presence which is automatically linked by > vt, and becomes a dependency with no matching rpm). > > memorylocked is unlimited: >> ->ulimit -a >> core file size (blocks, -c) 0 >> data seg size (kbytes, -d) unlimited >> scheduling priority (-e) 0 >> file size (blocks, -f) unlimited >> pending signals (-i) 515028 >> max locked memory (kbytes, -l) unlimited >> max memory size (kbytes, -m) unlimited >> open files (-n) 1024 >> pipe size (512 bytes, -p) 8 >> POSIX message queues (bytes, -q) 819200 >> real-time priority (-r) 0 >> stack size (kbytes, -s) 10240 >> cpu time (seconds, -t) unlimited >> max user processes (-u) 1024 >> virtual memory (kbytes, -v) unlimited >> file locks (-x) unlimited > IB devices are present: >> ->ibv_devinfo >> hca_id: mlx4_0 >> transport: InfiniBand (0) >> fw_ver: 2.9.1000 >> node_guid: 0002:c903:004d:b0e2 >> sys_image_guid: 0002:c903:004d:b0e5 >> vendor_id: 0x02c9 >> vendor_part_id: 26428 >> hw_ver: 0xB0 >> board_id: MT_0D90110009 >> phys_port_cnt: 1 >> port: 1 >> state: PORT_ACTIVE (4) >> max_mtu: 4096 (5) >> active_mtu: 4096 (5) >> sm_lid: 2 >> port_lid: 53 >> port_lmc: 0x00 >> link_layer: InfiniBand > > the hello program source: >> ->cat hello.c >> #include <stdio.h> >> #include <mpi.h> >> >> int main(int argc, char *argv[]) { >> int numprocs, rank, namelen; >> char processor_name[MPI_MAX_PROCESSOR_NAME]; >> >> MPI_Init(&argc, &argv); >> MPI_Comm_size(MPI_COMM_WORLD, &numprocs); >> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >> MPI_Get_processor_name(processor_name, &namelen); >> >> printf("Process %d on %s out of %d\n", rank, processor_name, numprocs); >> >> MPI_Finalize(); >> } > simply compiled as: >> mpicc hello.c -o hello > > the IB modules seem to be present: >> ->service openibd status >> >> HCA driver loaded >> >> Configured IPoIB devices: >> ib0 >> >> Currently active IPoIB devices: >> ib0 >> >> The following OFED modules are loaded: >> >> rdma_ucm >> rdma_cm >> ib_addr >> ib_ipoib >> mlx4_core >> mlx4_ib >> mlx4_en >> ib_mthca >> ib_uverbs >> ib_umad >> ib_sa >> ib_cm >> ib_mad >> ib_core >> iw_cxgb3 >> iw_cxgb4 >> iw_nes >> ib_qib > > Can anyone help? > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users