[OMPI users] Intermittent failure when launch application linked with OpenMPI 3.1.1
Hi, When launching an application linked with OpenMPI 3.1.1 using the line: srun --mpi=pmi2 --distribution=arbitrary --cpu_bind=map_cpu:0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126 -n 1024 a.out I often (most of the time) get: [amd-0013][[29472,1],727][connect/btl_openib_connect_udcm.c:1531:udcm_find_endpoint] could not find endpoint with port: 1, lid: 21, msg_type: 100 [amd-0013][[29472,1],727][connect/btl_openib_connect_udcm.c:2036:udcm_process_messages] could not find associated endpoint. -- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[29472,1],727]) is on host: amd-0013 Process 2 ([[29472,1],711]) is on host: unknown! BTLs attempted: self openib Your MPI job is now going to abort; sorry. -- [amd-0013:16718] *** An error occurred in MPI_Allreduce [amd-0013:16718] *** reported by process [1931476993,727] [amd-0013:16718] *** on communicator MPI_COMM_WORLD [amd-0013:16718] *** MPI_ERR_INTERN: internal error [amd-0013:16718] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, This failure is intermittent and I can sometimes get to work no problem. I have tried setting environment variables: export OMPI_MCA_btl_openib_connect_udcm_max_retry=500 export OMPI_MCA_btl_openib_connect_udcm_timeout=500 but it is uncertain that these are helping. Does anyone understand what is happening and how I can prevent it? Many thanks, Dave -- CCFD David Whitaker, Ph.D. whita...@cray.com Aerospace CFD Specialistphone: (651)605-9078 ISV Applications/Cray Inc fax: (651)605-9001 CCFD ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Memory Leak in 3.1.2 + UCX
We are seeing a gaping memory leak when running OpenMPI 3.1.x (or 2.1.2, for that matter) built with UCX support. The leak shows up whether the “ucx” PML is specified for the run or not. The applications in question are arepo and gizmo but it I have no reason to believe that others are not affected as well. Basically the MPI processes grow without bound until SLURM kills the job or the host memory is exhausted. If I configure and build with “--without-ucx” the problem goes away. I didn’t see anything about this on the UCX github site so I thought I’d ask here. Anyone else seeing the same or similar? What version of UCX is OpenMPI 3.1.x tested against? Regards, Charlie Taylor UF Research Computing Details: — RHEL7.5 OpenMPI 3.1.2 (and any other version I’ve tried). ucx 1.2.2-1.el7 (RH native) RH native IB stack Mellanox FDR/EDR IB fabric Intel Parallel Studio 2018.1.163 Configuration Options: — CFG_OPTS="" CFG_OPTS="$CFG_OPTS C=icc CXX=icpc FC=ifort FFLAGS=\"-O2 -g -warn -m64\" LDFLAGS=\"\" " CFG_OPTS="$CFG_OPTS --enable-static" CFG_OPTS="$CFG_OPTS --enable-orterun-prefix-by-default" CFG_OPTS="$CFG_OPTS --with-slurm=/opt/slurm" CFG_OPTS="$CFG_OPTS --with-pmix=/opt/pmix/2.1.1" CFG_OPTS="$CFG_OPTS --with-pmi=/opt/slurm" CFG_OPTS="$CFG_OPTS --with-libevent=external" CFG_OPTS="$CFG_OPTS --with-hwloc=external" CFG_OPTS="$CFG_OPTS --with-verbs=/usr" CFG_OPTS="$CFG_OPTS --with-libfabric=/usr" CFG_OPTS="$CFG_OPTS --with-ucx=/usr" CFG_OPTS="$CFG_OPTS --with-verbs-libdir=/usr/lib64" CFG_OPTS="$CFG_OPTS --with-mxm=no" CFG_OPTS="$CFG_OPTS --with-cuda=${HPC_CUDA_DIR}" CFG_OPTS="$CFG_OPTS --enable-openib-udcm" CFG_OPTS="$CFG_OPTS --enable-openib-rdmacm" CFG_OPTS="$CFG_OPTS --disable-pmix-dstore" rpmbuild --ba \ --define '_name openmpi' \ --define "_version $OMPI_VER" \ --define "_release ${RELEASE}" \ --define "_prefix $PREFIX" \ --define '_mandir %{_prefix}/share/man' \ --define '_defaultdocdir %{_prefix}' \ --define 'mflags -j 8' \ --define 'use_default_rpm_opt_flags 1' \ --define 'use_check_files 0' \ --define 'install_shell_scripts 1' \ --define 'shell_scripts_basename mpivars' \ --define "configure_options $CFG_OPTS " \ openmpi-${OMPI_VER}.spec 2>&1 | tee rpmbuild.log ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Cannot run MPI code on multiple cores with PBS
Note that what Gilles said is correct: it's not just the dependent libraries of libmpi.so (and friends) that matter -- it's also the dependent libraries of all of Open MPI's plugins that matter. You can run "ldd *.so" in the lib directory where you installed Open MPI, but you'll also need to "ldd *.so" in the lib/openmpi directory -- that's where Open MPI installs its plugins. I suspect that if you run "ldd lib/openmpi/mca_plm_tm.so" on the head node, you'll see all the dependent libraries listed. But if you run the same command on your back-end compute nodes, it might say "not found" for some of the libraries. > On Oct 4, 2018, at 9:12 AM, John Hearns via users > wrote: > > Michele, the command is ldd ./code.io > I just Googled - ldd means List dynamic Dependencies > > To find out the PBS batch system type - that is a good question! > Try this: qstat --version > > > > On Thu, 4 Oct 2018 at 10:12, Castellana Michele > wrote: >> >> Dear John, >> Thank you for your reply. I have tried >> >> ldd mpirun ./code.o >> >> but I get an error message, I do not know what is the proper syntax to use >> ldd command. Here is the information about the Linux version >> >> $ cat /etc/os-release >> NAME="CentOS Linux" >> VERSION="7 (Core)" >> ID="centos" >> ID_LIKE="rhel fedora" >> VERSION_ID="7" >> PRETTY_NAME="CentOS Linux 7 (Core)" >> ANSI_COLOR="0;31" >> CPE_NAME="cpe:/o:centos:centos:7" >> HOME_URL="https://www.centos.org/; >> BUG_REPORT_URL="https://bugs.centos.org/; >> >> CENTOS_MANTISBT_PROJECT="CentOS-7" >> CENTOS_MANTISBT_PROJECT_VERSION="7" >> REDHAT_SUPPORT_PRODUCT="centos" >> REDHAT_SUPPORT_PRODUCT_VERSION=“7" >> >> May you please tell me how to check whether the batch system is PBSPro or >> OpenPBS? >> >> Best, >> >> >> >> >> On Oct 4, 2018, at 10:30 AM, John Hearns via users >> wrote: >> >> Michele one tip: log into a compute node using ssh and as your own >> username. >> If you use the Modules envirnonment then load the modules you use in >> the job script >> then use the ldd utility to check if you can load all the libraries >> in the code.io executable >> >> Actually you are better to submit a short batch job which does not use >> mpirun but uses ldd >> A proper batch job will duplicate the environment you wish to run in. >> >> ldd ./code.io >> >> By the way, is the batch system PBSPro or OpenPBS? Version 6 seems a bit >> old. >> Can you say what version of Redhat or CentOS this cluster is installed with? >> >> >> >> On Thu, 4 Oct 2018 at 00:02, Castellana Michele >> wrote: >> >> I fixed it, the correct file was in /lib64, not in /lib. >> >> Thank you for your help. >> >> On Oct 3, 2018, at 11:30 PM, Castellana Michele >> wrote: >> >> Thank you, I found some libcrypto files in /usr/lib indeed: >> >> $ ls libcry* >> libcrypt-2.17.so libcrypto.so.10 libcrypto.so.1.0.2k libcrypt.so.1 >> >> but I could not find libcrypto.so.0.9.8. Here they suggest to create a >> hyperlink, but if I do I still get an error from MPI. Is there another way >> around this? >> >> Best, >> >> On Oct 3, 2018, at 11:00 PM, Jeff Squyres (jsquyres) via users >> wrote: >> >> It's probably in your Linux distro somewhere -- I'd guess you're missing a >> package (e.g., an RPM or a deb) out on your compute nodes...? >> >> >> On Oct 3, 2018, at 4:24 PM, Castellana Michele >> wrote: >> >> Dear Ralph, >> Thank you for your reply. Do you know where I could find libcrypto.so.0.9.8 ? >> >> Best, >> >> On Oct 3, 2018, at 9:41 PM, Ralph H Castain wrote: >> >> Actually, I see that you do have the tm components built, but they cannot be >> loaded because you are missing libcrypto from your LD_LIBRARY_PATH >> >> >> On Oct 3, 2018, at 12:33 PM, Ralph H Castain wrote: >> >> Did you configure OMPI —with-tm=? It looks like we didn’t >> build PBS support and so we only see one node with a single slot allocated >> to it. >> >> >> On Oct 3, 2018, at 12:02 PM, Castellana Michele >> wrote: >> >> Dear all, >> I am having trouble running an MPI code across multiple cores on a new >> computer cluster, which uses PBS. Here is a minimal example, where I want to >> run two MPI processes, each on a different node. The PBS script is >> >> #!/bin/bash >> #PBS -l walltime=00:01:00 >> #PBS -l mem=1gb >> #PBS -l nodes=2:ppn=1 >> #PBS -q batch >> #PBS -N test >> mpirun -np 2 ./code.o >> >> and when I submit it with >> >> $qsub script.sh >> >> I get the following message in the PBS error file >> >> $ cat test.e1234 >> [shbli040:08879] mca_base_component_repository_open: unable to open >> mca_plm_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file >> or directory (ignored) >> [shbli040:08879] mca_base_component_repository_open: unable to open >> mca_oob_ud: libibverbs.so.1: cannot open shared object file: No such file or >> directory (ignored) >> [shbli040:08879] mca_base_component_repository_open: unable to open >> mca_ras_tm:
Re: [OMPI users] Cannot run MPI code on multiple cores with PBS
Michele, the command is ldd ./code.io I just Googled - ldd means List dynamic Dependencies To find out the PBS batch system type - that is a good question! Try this: qstat --version On Thu, 4 Oct 2018 at 10:12, Castellana Michele wrote: > > Dear John, > Thank you for your reply. I have tried > > ldd mpirun ./code.o > > but I get an error message, I do not know what is the proper syntax to use > ldd command. Here is the information about the Linux version > > $ cat /etc/os-release > NAME="CentOS Linux" > VERSION="7 (Core)" > ID="centos" > ID_LIKE="rhel fedora" > VERSION_ID="7" > PRETTY_NAME="CentOS Linux 7 (Core)" > ANSI_COLOR="0;31" > CPE_NAME="cpe:/o:centos:centos:7" > HOME_URL="https://www.centos.org/; > BUG_REPORT_URL="https://bugs.centos.org/; > > CENTOS_MANTISBT_PROJECT="CentOS-7" > CENTOS_MANTISBT_PROJECT_VERSION="7" > REDHAT_SUPPORT_PRODUCT="centos" > REDHAT_SUPPORT_PRODUCT_VERSION=“7" > > May you please tell me how to check whether the batch system is PBSPro or > OpenPBS? > > Best, > > > > > On Oct 4, 2018, at 10:30 AM, John Hearns via users > wrote: > > Michele one tip: log into a compute node using ssh and as your own > username. > If you use the Modules envirnonment then load the modules you use in > the job script > then use the ldd utility to check if you can load all the libraries > in the code.io executable > > Actually you are better to submit a short batch job which does not use > mpirun but uses ldd > A proper batch job will duplicate the environment you wish to run in. > >ldd ./code.io > > By the way, is the batch system PBSPro or OpenPBS? Version 6 seems a bit old. > Can you say what version of Redhat or CentOS this cluster is installed with? > > > > On Thu, 4 Oct 2018 at 00:02, Castellana Michele > wrote: > > I fixed it, the correct file was in /lib64, not in /lib. > > Thank you for your help. > > On Oct 3, 2018, at 11:30 PM, Castellana Michele > wrote: > > Thank you, I found some libcrypto files in /usr/lib indeed: > > $ ls libcry* > libcrypt-2.17.so libcrypto.so.10 libcrypto.so.1.0.2k libcrypt.so.1 > > but I could not find libcrypto.so.0.9.8. Here they suggest to create a > hyperlink, but if I do I still get an error from MPI. Is there another way > around this? > > Best, > > On Oct 3, 2018, at 11:00 PM, Jeff Squyres (jsquyres) via users > wrote: > > It's probably in your Linux distro somewhere -- I'd guess you're missing a > package (e.g., an RPM or a deb) out on your compute nodes...? > > > On Oct 3, 2018, at 4:24 PM, Castellana Michele > wrote: > > Dear Ralph, > Thank you for your reply. Do you know where I could find libcrypto.so.0.9.8 ? > > Best, > > On Oct 3, 2018, at 9:41 PM, Ralph H Castain wrote: > > Actually, I see that you do have the tm components built, but they cannot be > loaded because you are missing libcrypto from your LD_LIBRARY_PATH > > > On Oct 3, 2018, at 12:33 PM, Ralph H Castain wrote: > > Did you configure OMPI —with-tm=? It looks like we didn’t > build PBS support and so we only see one node with a single slot allocated to > it. > > > On Oct 3, 2018, at 12:02 PM, Castellana Michele > wrote: > > Dear all, > I am having trouble running an MPI code across multiple cores on a new > computer cluster, which uses PBS. Here is a minimal example, where I want to > run two MPI processes, each on a different node. The PBS script is > > #!/bin/bash > #PBS -l walltime=00:01:00 > #PBS -l mem=1gb > #PBS -l nodes=2:ppn=1 > #PBS -q batch > #PBS -N test > mpirun -np 2 ./code.o > > and when I submit it with > > $qsub script.sh > > I get the following message in the PBS error file > > $ cat test.e1234 > [shbli040:08879] mca_base_component_repository_open: unable to open > mca_plm_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file > or directory (ignored) > [shbli040:08879] mca_base_component_repository_open: unable to open > mca_oob_ud: libibverbs.so.1: cannot open shared object file: No such file or > directory (ignored) > [shbli040:08879] mca_base_component_repository_open: unable to open > mca_ras_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file > or directory (ignored) > -- > There are not enough slots available in the system to satisfy the 2 slots > that were requested by the application: > ./code.o > > Either request fewer slots for your application, or make more slots available > for use. > — > > The PBS version is > > $ qstat --version > Version: 6.1.2 > > and here is some additional information on the MPI version > > $ mpicc -v > Using built-in specs. > COLLECT_GCC=/bin/gcc > COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper > Target: x86_64-redhat-linux > […] > Thread model: posix > gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) > > Do you guys know what may be the issue here? > > Thank you > Best, > > > > > > > >
Re: [OMPI users] Cannot run MPI code on multiple cores with PBS
Dear John, Thank you for your reply. I have tried ldd mpirun ./code.o but I get an error message, I do not know what is the proper syntax to use ldd command. Here is the information about the Linux version $ cat /etc/os-release NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/; BUG_REPORT_URL="https://bugs.centos.org/; CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION=“7" May you please tell me how to check whether the batch system is PBSPro or OpenPBS? Best, On Oct 4, 2018, at 10:30 AM, John Hearns via users wrote: Michele one tip: log into a compute node using ssh and as your own username. If you use the Modules envirnonment then load the modules you use in the job script then use the ldd utility to check if you can load all the libraries in the code.io executable Actually you are better to submit a short batch job which does not use mpirun but uses ldd A proper batch job will duplicate the environment you wish to run in. ldd ./code.io By the way, is the batch system PBSPro or OpenPBS? Version 6 seems a bit old. Can you say what version of Redhat or CentOS this cluster is installed with? On Thu, 4 Oct 2018 at 00:02, Castellana Michele wrote: I fixed it, the correct file was in /lib64, not in /lib. Thank you for your help. On Oct 3, 2018, at 11:30 PM, Castellana Michele wrote: Thank you, I found some libcrypto files in /usr/lib indeed: $ ls libcry* libcrypt-2.17.so libcrypto.so.10 libcrypto.so.1.0.2k libcrypt.so.1 but I could not find libcrypto.so.0.9.8. Here they suggest to create a hyperlink, but if I do I still get an error from MPI. Is there another way around this? Best, On Oct 3, 2018, at 11:00 PM, Jeff Squyres (jsquyres) via users wrote: It's probably in your Linux distro somewhere -- I'd guess you're missing a package (e.g., an RPM or a deb) out on your compute nodes...? On Oct 3, 2018, at 4:24 PM, Castellana Michele wrote: Dear Ralph, Thank you for your reply. Do you know where I could find libcrypto.so.0.9.8 ? Best, On Oct 3, 2018, at 9:41 PM, Ralph H Castain wrote: Actually, I see that you do have the tm components built, but they cannot be loaded because you are missing libcrypto from your LD_LIBRARY_PATH On Oct 3, 2018, at 12:33 PM, Ralph H Castain wrote: Did you configure OMPI —with-tm=? It looks like we didn’t build PBS support and so we only see one node with a single slot allocated to it. On Oct 3, 2018, at 12:02 PM, Castellana Michele wrote: Dear all, I am having trouble running an MPI code across multiple cores on a new computer cluster, which uses PBS. Here is a minimal example, where I want to run two MPI processes, each on a different node. The PBS script is #!/bin/bash #PBS -l walltime=00:01:00 #PBS -l mem=1gb #PBS -l nodes=2:ppn=1 #PBS -q batch #PBS -N test mpirun -np 2 ./code.o and when I submit it with $qsub script.sh I get the following message in the PBS error file $ cat test.e1234 [shbli040:08879] mca_base_component_repository_open: unable to open mca_plm_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file or directory (ignored) [shbli040:08879] mca_base_component_repository_open: unable to open mca_oob_ud: libibverbs.so.1: cannot open shared object file: No such file or directory (ignored) [shbli040:08879] mca_base_component_repository_open: unable to open mca_ras_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file or directory (ignored) -- There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: ./code.o Either request fewer slots for your application, or make more slots available for use. — The PBS version is $ qstat --version Version: 6.1.2 and here is some additional information on the MPI version $ mpicc -v Using built-in specs. COLLECT_GCC=/bin/gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper Target: x86_64-redhat-linux […] Thread model: posix gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) Do you guys know what may be the issue here? Thank you Best, ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org
Re: [OMPI users] Cannot run MPI code on multiple cores with PBS
In this case, some Open MPI plugins are missing some third party libraries, so you would have to ldd all the plugins (e.g. the .so files) located in /lib/openmpi in order to evidence any issue. Cheers, Gilles On Thu, Oct 4, 2018 at 4:34 PM John Hearns via users wrote: > > Michele one tip: log into a compute node using ssh and as your own > username. > If you use the Modules envirnonment then load the modules you use in > the job script > then use the ldd utility to check if you can load all the libraries > in the code.io executable > > Actually you are better to submit a short batch job which does not use > mpirun but uses ldd > A proper batch job will duplicate the environment you wish to run in. > > ldd ./code.io > > By the way, is the batch system PBSPro or OpenPBS? Version 6 seems a bit old. > Can you say what version of Redhat or CentOS this cluster is installed with? > > > > On Thu, 4 Oct 2018 at 00:02, Castellana Michele > wrote: > > > > I fixed it, the correct file was in /lib64, not in /lib. > > > > Thank you for your help. > > > > On Oct 3, 2018, at 11:30 PM, Castellana Michele > > wrote: > > > > Thank you, I found some libcrypto files in /usr/lib indeed: > > > > $ ls libcry* > > libcrypt-2.17.so libcrypto.so.10 libcrypto.so.1.0.2k libcrypt.so.1 > > > > but I could not find libcrypto.so.0.9.8. Here they suggest to create a > > hyperlink, but if I do I still get an error from MPI. Is there another way > > around this? > > > > Best, > > > > On Oct 3, 2018, at 11:00 PM, Jeff Squyres (jsquyres) via users > > wrote: > > > > It's probably in your Linux distro somewhere -- I'd guess you're missing a > > package (e.g., an RPM or a deb) out on your compute nodes...? > > > > > > On Oct 3, 2018, at 4:24 PM, Castellana Michele > > wrote: > > > > Dear Ralph, > > Thank you for your reply. Do you know where I could find libcrypto.so.0.9.8 > > ? > > > > Best, > > > > On Oct 3, 2018, at 9:41 PM, Ralph H Castain wrote: > > > > Actually, I see that you do have the tm components built, but they cannot > > be loaded because you are missing libcrypto from your LD_LIBRARY_PATH > > > > > > On Oct 3, 2018, at 12:33 PM, Ralph H Castain wrote: > > > > Did you configure OMPI —with-tm=? It looks like we didn’t > > build PBS support and so we only see one node with a single slot allocated > > to it. > > > > > > On Oct 3, 2018, at 12:02 PM, Castellana Michele > > wrote: > > > > Dear all, > > I am having trouble running an MPI code across multiple cores on a new > > computer cluster, which uses PBS. Here is a minimal example, where I want > > to run two MPI processes, each on a different node. The PBS script is > > > > #!/bin/bash > > #PBS -l walltime=00:01:00 > > #PBS -l mem=1gb > > #PBS -l nodes=2:ppn=1 > > #PBS -q batch > > #PBS -N test > > mpirun -np 2 ./code.o > > > > and when I submit it with > > > > $qsub script.sh > > > > I get the following message in the PBS error file > > > > $ cat test.e1234 > > [shbli040:08879] mca_base_component_repository_open: unable to open > > mca_plm_tm: libcrypto.so.0.9.8: cannot open shared object file: No such > > file or directory (ignored) > > [shbli040:08879] mca_base_component_repository_open: unable to open > > mca_oob_ud: libibverbs.so.1: cannot open shared object file: No such file > > or directory (ignored) > > [shbli040:08879] mca_base_component_repository_open: unable to open > > mca_ras_tm: libcrypto.so.0.9.8: cannot open shared object file: No such > > file or directory (ignored) > > -- > > There are not enough slots available in the system to satisfy the 2 slots > > that were requested by the application: > > ./code.o > > > > Either request fewer slots for your application, or make more slots > > available > > for use. > > — > > > > The PBS version is > > > > $ qstat --version > > Version: 6.1.2 > > > > and here is some additional information on the MPI version > > > > $ mpicc -v > > Using built-in specs. > > COLLECT_GCC=/bin/gcc > > COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper > > Target: x86_64-redhat-linux > > […] > > Thread model: posix > > gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) > > > > Do you guys know what may be the issue here? > > > > Thank you > > Best, > > > > > > > > > > > > > > > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > > > > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > > > > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > > > > > ___ > > users mailing list > > users@lists.open-mpi.org >
Re: [OMPI users] Cannot run MPI code on multiple cores with PBS
Michele one tip: log into a compute node using ssh and as your own username. If you use the Modules envirnonment then load the modules you use in the job script then use the ldd utility to check if you can load all the libraries in the code.io executable Actually you are better to submit a short batch job which does not use mpirun but uses ldd A proper batch job will duplicate the environment you wish to run in. ldd ./code.io By the way, is the batch system PBSPro or OpenPBS? Version 6 seems a bit old. Can you say what version of Redhat or CentOS this cluster is installed with? On Thu, 4 Oct 2018 at 00:02, Castellana Michele wrote: > > I fixed it, the correct file was in /lib64, not in /lib. > > Thank you for your help. > > On Oct 3, 2018, at 11:30 PM, Castellana Michele > wrote: > > Thank you, I found some libcrypto files in /usr/lib indeed: > > $ ls libcry* > libcrypt-2.17.so libcrypto.so.10 libcrypto.so.1.0.2k libcrypt.so.1 > > but I could not find libcrypto.so.0.9.8. Here they suggest to create a > hyperlink, but if I do I still get an error from MPI. Is there another way > around this? > > Best, > > On Oct 3, 2018, at 11:00 PM, Jeff Squyres (jsquyres) via users > wrote: > > It's probably in your Linux distro somewhere -- I'd guess you're missing a > package (e.g., an RPM or a deb) out on your compute nodes...? > > > On Oct 3, 2018, at 4:24 PM, Castellana Michele > wrote: > > Dear Ralph, > Thank you for your reply. Do you know where I could find libcrypto.so.0.9.8 ? > > Best, > > On Oct 3, 2018, at 9:41 PM, Ralph H Castain wrote: > > Actually, I see that you do have the tm components built, but they cannot be > loaded because you are missing libcrypto from your LD_LIBRARY_PATH > > > On Oct 3, 2018, at 12:33 PM, Ralph H Castain wrote: > > Did you configure OMPI —with-tm=? It looks like we didn’t > build PBS support and so we only see one node with a single slot allocated to > it. > > > On Oct 3, 2018, at 12:02 PM, Castellana Michele > wrote: > > Dear all, > I am having trouble running an MPI code across multiple cores on a new > computer cluster, which uses PBS. Here is a minimal example, where I want to > run two MPI processes, each on a different node. The PBS script is > > #!/bin/bash > #PBS -l walltime=00:01:00 > #PBS -l mem=1gb > #PBS -l nodes=2:ppn=1 > #PBS -q batch > #PBS -N test > mpirun -np 2 ./code.o > > and when I submit it with > > $qsub script.sh > > I get the following message in the PBS error file > > $ cat test.e1234 > [shbli040:08879] mca_base_component_repository_open: unable to open > mca_plm_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file > or directory (ignored) > [shbli040:08879] mca_base_component_repository_open: unable to open > mca_oob_ud: libibverbs.so.1: cannot open shared object file: No such file or > directory (ignored) > [shbli040:08879] mca_base_component_repository_open: unable to open > mca_ras_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file > or directory (ignored) > -- > There are not enough slots available in the system to satisfy the 2 slots > that were requested by the application: > ./code.o > > Either request fewer slots for your application, or make more slots available > for use. > — > > The PBS version is > > $ qstat --version > Version: 6.1.2 > > and here is some additional information on the MPI version > > $ mpicc -v > Using built-in specs. > COLLECT_GCC=/bin/gcc > COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper > Target: x86_64-redhat-linux > […] > Thread model: posix > gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) > > Do you guys know what may be the issue here? > > Thank you > Best, > > > > > > > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users