date:20181004

[OMPI users] Intermittent failure when launch application linked with OpenMPI 3.1.1

2018-10-04 Thread David Whitaker


Hi,
 When launching an application linked with OpenMPI 3.1.1 using the line:
srun --mpi=pmi2 --distribution=arbitrary 
--cpu_bind=map_cpu:0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126 
-n 1024 a.out


 I often (most of the time) get:

[amd-0013][[29472,1],727][connect/btl_openib_connect_udcm.c:1531:udcm_find_endpoint] 
could not find endpoint with port: 1, lid: 21, msg_type: 100
[amd-0013][[29472,1],727][connect/btl_openib_connect_udcm.c:2036:udcm_process_messages] 
could not find associated endpoint.

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[29472,1],727]) is on host: amd-0013
  Process 2 ([[29472,1],711]) is on host: unknown!
  BTLs attempted: self openib

Your MPI job is now going to abort; sorry.
--
[amd-0013:16718] *** An error occurred in MPI_Allreduce
[amd-0013:16718] *** reported by process [1931476993,727]
[amd-0013:16718] *** on communicator MPI_COMM_WORLD
[amd-0013:16718] *** MPI_ERR_INTERN: internal error
[amd-0013:16718] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,



  This failure is intermittent and I can sometimes get to work no problem.
   I have tried setting environment variables:
export OMPI_MCA_btl_openib_connect_udcm_max_retry=500
export OMPI_MCA_btl_openib_connect_udcm_timeout=500

but it is uncertain that these are helping.

Does anyone understand what is happening and how I can prevent it?

Many thanks,
   Dave

--
CCFD
David Whitaker, Ph.D.  whita...@cray.com
Aerospace CFD Specialistphone: (651)605-9078
ISV Applications/Cray Inc fax: (651)605-9001
CCFD

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Memory Leak in 3.1.2 + UCX

2018-10-04 Thread Charles A Taylor


We are seeing a gaping memory leak when running OpenMPI 3.1.x (or 2.1.2, for 
that matter) built with UCX support.   The leak shows up
whether the “ucx” PML is specified for the run or not.  The applications in 
question are arepo and gizmo but it I have no reason to believe
that others are not affected as well.

Basically the MPI processes grow without bound until SLURM kills the job or the 
host memory is exhausted.  
If I configure and build with “--without-ucx” the problem goes away.

I didn’t see anything about this on the UCX github site so I thought I’d ask 
here.  Anyone else seeing the same or similar?

What version of UCX is OpenMPI 3.1.x tested against?

Regards,

Charlie Taylor
UF Research Computing

Details:
—
RHEL7.5
OpenMPI 3.1.2 (and any other version I’ve tried).
ucx 1.2.2-1.el7 (RH native)
RH native IB stack
Mellanox FDR/EDR IB fabric
Intel Parallel Studio 2018.1.163

Configuration Options:
—
CFG_OPTS=""
CFG_OPTS="$CFG_OPTS C=icc CXX=icpc FC=ifort FFLAGS=\"-O2 -g -warn -m64\" 
LDFLAGS=\"\" "
CFG_OPTS="$CFG_OPTS --enable-static"
CFG_OPTS="$CFG_OPTS --enable-orterun-prefix-by-default"
CFG_OPTS="$CFG_OPTS --with-slurm=/opt/slurm"
CFG_OPTS="$CFG_OPTS --with-pmix=/opt/pmix/2.1.1"
CFG_OPTS="$CFG_OPTS --with-pmi=/opt/slurm"
CFG_OPTS="$CFG_OPTS --with-libevent=external"
CFG_OPTS="$CFG_OPTS --with-hwloc=external"
CFG_OPTS="$CFG_OPTS --with-verbs=/usr"
CFG_OPTS="$CFG_OPTS --with-libfabric=/usr"
CFG_OPTS="$CFG_OPTS --with-ucx=/usr"
CFG_OPTS="$CFG_OPTS --with-verbs-libdir=/usr/lib64"
CFG_OPTS="$CFG_OPTS --with-mxm=no"
CFG_OPTS="$CFG_OPTS --with-cuda=${HPC_CUDA_DIR}"
CFG_OPTS="$CFG_OPTS --enable-openib-udcm"
CFG_OPTS="$CFG_OPTS --enable-openib-rdmacm"
CFG_OPTS="$CFG_OPTS --disable-pmix-dstore"

rpmbuild --ba \
 --define '_name openmpi' \
 --define "_version $OMPI_VER" \
 --define "_release ${RELEASE}" \
 --define "_prefix $PREFIX" \
 --define '_mandir %{_prefix}/share/man' \
 --define '_defaultdocdir %{_prefix}' \
 --define 'mflags -j 8' \
 --define 'use_default_rpm_opt_flags 1' \
 --define 'use_check_files 0' \
 --define 'install_shell_scripts 1' \
 --define 'shell_scripts_basename mpivars' \
 --define "configure_options $CFG_OPTS " \
 openmpi-${OMPI_VER}.spec 2>&1 | tee rpmbuild.log




___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

2018-10-04 Thread Jeff Squyres (jsquyres) via users

Note that what Gilles said is correct: it's not just the dependent libraries of 
libmpi.so (and friends) that matter -- it's also the dependent libraries of all 
of Open MPI's plugins that matter.

You can run "ldd *.so" in the lib directory where you installed Open MPI, but 
you'll also need to "ldd *.so" in the lib/openmpi directory -- that's where 
Open MPI installs its plugins.

I suspect that if you run "ldd lib/openmpi/mca_plm_tm.so" on the head node, 
you'll see all the dependent libraries listed.  But if you run the same command 
on your back-end compute nodes, it might say "not found" for some of the 
libraries.



> On Oct 4, 2018, at 9:12 AM, John Hearns via users  
> wrote:
> 
> Michele, the command is   ldd ./code.io
> I just Googled - ldd  means List dynamic Dependencies
> 
> To find out the PBS batch system type - that is a good question!
> Try this: qstat --version
> 
> 
> 
> On Thu, 4 Oct 2018 at 10:12, Castellana Michele
>  wrote:
>> 
>> Dear John,
>> Thank you for your reply. I have tried
>> 
>> ldd mpirun ./code.o
>> 
>> but I get an error message, I do not know what is the proper syntax to use 
>> ldd command. Here is the information about the Linux version
>> 
>> $ cat /etc/os-release
>> NAME="CentOS Linux"
>> VERSION="7 (Core)"
>> ID="centos"
>> ID_LIKE="rhel fedora"
>> VERSION_ID="7"
>> PRETTY_NAME="CentOS Linux 7 (Core)"
>> ANSI_COLOR="0;31"
>> CPE_NAME="cpe:/o:centos:centos:7"
>> HOME_URL="https://www.centos.org/;
>> BUG_REPORT_URL="https://bugs.centos.org/;
>> 
>> CENTOS_MANTISBT_PROJECT="CentOS-7"
>> CENTOS_MANTISBT_PROJECT_VERSION="7"
>> REDHAT_SUPPORT_PRODUCT="centos"
>> REDHAT_SUPPORT_PRODUCT_VERSION=“7"
>> 
>> May you please tell me how to check whether the batch system is PBSPro or 
>> OpenPBS?
>> 
>> Best,
>> 
>> 
>> 
>> 
>> On Oct 4, 2018, at 10:30 AM, John Hearns via users 
>>  wrote:
>> 
>> Michele  one tip:   log into a compute node using ssh and as your own 
>> username.
>> If you use the Modules envirnonment then load the modules you use in
>> the job script
>> then use the  ldd  utility to check if you can load all the libraries
>> in the code.io executable
>> 
>> Actually you are better to submit a short batch job which does not use
>> mpirun but uses ldd
>> A proper batch job will duplicate the environment you wish to run in.
>> 
>>   ldd ./code.io
>> 
>> By the way, is the batch system PBSPro or OpenPBS?  Version 6 seems a bit 
>> old.
>> Can you say what version of Redhat or CentOS this cluster is installed with?
>> 
>> 
>> 
>> On Thu, 4 Oct 2018 at 00:02, Castellana Michele
>>  wrote:
>> 
>> I fixed it, the correct file was in /lib64, not in /lib.
>> 
>> Thank you for your help.
>> 
>> On Oct 3, 2018, at 11:30 PM, Castellana Michele 
>>  wrote:
>> 
>> Thank you, I found some libcrypto files in /usr/lib indeed:
>> 
>> $ ls libcry*
>> libcrypt-2.17.so  libcrypto.so.10  libcrypto.so.1.0.2k  libcrypt.so.1
>> 
>> but I could not find libcrypto.so.0.9.8. Here they suggest to create a 
>> hyperlink, but if I do I still get an error from MPI. Is there another way 
>> around this?
>> 
>> Best,
>> 
>> On Oct 3, 2018, at 11:00 PM, Jeff Squyres (jsquyres) via users 
>>  wrote:
>> 
>> It's probably in your Linux distro somewhere -- I'd guess you're missing a 
>> package (e.g., an RPM or a deb) out on your compute nodes...?
>> 
>> 
>> On Oct 3, 2018, at 4:24 PM, Castellana Michele  
>> wrote:
>> 
>> Dear Ralph,
>> Thank you for your reply. Do you know where I could find libcrypto.so.0.9.8 ?
>> 
>> Best,
>> 
>> On Oct 3, 2018, at 9:41 PM, Ralph H Castain  wrote:
>> 
>> Actually, I see that you do have the tm components built, but they cannot be 
>> loaded because you are missing libcrypto from your LD_LIBRARY_PATH
>> 
>> 
>> On Oct 3, 2018, at 12:33 PM, Ralph H Castain  wrote:
>> 
>> Did you configure OMPI —with-tm=? It looks like we didn’t 
>> build PBS support and so we only see one node with a single slot allocated 
>> to it.
>> 
>> 
>> On Oct 3, 2018, at 12:02 PM, Castellana Michele 
>>  wrote:
>> 
>> Dear all,
>> I am having trouble running an MPI code across multiple cores on a new 
>> computer cluster, which uses PBS. Here is a minimal example, where I want to 
>> run two MPI processes, each on  a different node. The PBS script is
>> 
>> #!/bin/bash
>> #PBS -l walltime=00:01:00
>> #PBS -l mem=1gb
>> #PBS -l nodes=2:ppn=1
>> #PBS -q batch
>> #PBS -N test
>> mpirun -np 2 ./code.o
>> 
>> and when I submit it with
>> 
>> $qsub script.sh
>> 
>> I get the following message in the PBS error file
>> 
>> $ cat test.e1234
>> [shbli040:08879] mca_base_component_repository_open: unable to open 
>> mca_plm_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file 
>> or directory (ignored)
>> [shbli040:08879] mca_base_component_repository_open: unable to open 
>> mca_oob_ud: libibverbs.so.1: cannot open shared object file: No such file or 
>> directory (ignored)
>> [shbli040:08879] mca_base_component_repository_open: unable to open 
>> mca_ras_tm:

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

2018-10-04 Thread John Hearns via users

Michele, the command is   ldd ./code.io
I just Googled - ldd  means List dynamic Dependencies

To find out the PBS batch system type - that is a good question!
Try this: qstat --version



On Thu, 4 Oct 2018 at 10:12, Castellana Michele
 wrote:
>
> Dear John,
> Thank you for your reply. I have tried
>
> ldd mpirun ./code.o
>
> but I get an error message, I do not know what is the proper syntax to use 
> ldd command. Here is the information about the Linux version
>
> $ cat /etc/os-release
> NAME="CentOS Linux"
> VERSION="7 (Core)"
> ID="centos"
> ID_LIKE="rhel fedora"
> VERSION_ID="7"
> PRETTY_NAME="CentOS Linux 7 (Core)"
> ANSI_COLOR="0;31"
> CPE_NAME="cpe:/o:centos:centos:7"
> HOME_URL="https://www.centos.org/;
> BUG_REPORT_URL="https://bugs.centos.org/;
>
> CENTOS_MANTISBT_PROJECT="CentOS-7"
> CENTOS_MANTISBT_PROJECT_VERSION="7"
> REDHAT_SUPPORT_PRODUCT="centos"
> REDHAT_SUPPORT_PRODUCT_VERSION=“7"
>
> May you please tell me how to check whether the batch system is PBSPro or 
> OpenPBS?
>
> Best,
>
>
>
>
> On Oct 4, 2018, at 10:30 AM, John Hearns via users  
> wrote:
>
> Michele  one tip:   log into a compute node using ssh and as your own 
> username.
> If you use the Modules envirnonment then load the modules you use in
> the job script
> then use the  ldd  utility to check if you can load all the libraries
> in the code.io executable
>
> Actually you are better to submit a short batch job which does not use
> mpirun but uses ldd
> A proper batch job will duplicate the environment you wish to run in.
>
>ldd ./code.io
>
> By the way, is the batch system PBSPro or OpenPBS?  Version 6 seems a bit old.
> Can you say what version of Redhat or CentOS this cluster is installed with?
>
>
>
> On Thu, 4 Oct 2018 at 00:02, Castellana Michele
>  wrote:
>
> I fixed it, the correct file was in /lib64, not in /lib.
>
> Thank you for your help.
>
> On Oct 3, 2018, at 11:30 PM, Castellana Michele  
> wrote:
>
> Thank you, I found some libcrypto files in /usr/lib indeed:
>
> $ ls libcry*
> libcrypt-2.17.so  libcrypto.so.10  libcrypto.so.1.0.2k  libcrypt.so.1
>
> but I could not find libcrypto.so.0.9.8. Here they suggest to create a 
> hyperlink, but if I do I still get an error from MPI. Is there another way 
> around this?
>
> Best,
>
> On Oct 3, 2018, at 11:00 PM, Jeff Squyres (jsquyres) via users 
>  wrote:
>
> It's probably in your Linux distro somewhere -- I'd guess you're missing a 
> package (e.g., an RPM or a deb) out on your compute nodes...?
>
>
> On Oct 3, 2018, at 4:24 PM, Castellana Michele  
> wrote:
>
> Dear Ralph,
> Thank you for your reply. Do you know where I could find libcrypto.so.0.9.8 ?
>
> Best,
>
> On Oct 3, 2018, at 9:41 PM, Ralph H Castain  wrote:
>
> Actually, I see that you do have the tm components built, but they cannot be 
> loaded because you are missing libcrypto from your LD_LIBRARY_PATH
>
>
> On Oct 3, 2018, at 12:33 PM, Ralph H Castain  wrote:
>
> Did you configure OMPI —with-tm=? It looks like we didn’t 
> build PBS support and so we only see one node with a single slot allocated to 
> it.
>
>
> On Oct 3, 2018, at 12:02 PM, Castellana Michele  
> wrote:
>
> Dear all,
> I am having trouble running an MPI code across multiple cores on a new 
> computer cluster, which uses PBS. Here is a minimal example, where I want to 
> run two MPI processes, each on  a different node. The PBS script is
>
> #!/bin/bash
> #PBS -l walltime=00:01:00
> #PBS -l mem=1gb
> #PBS -l nodes=2:ppn=1
> #PBS -q batch
> #PBS -N test
> mpirun -np 2 ./code.o
>
> and when I submit it with
>
> $qsub script.sh
>
> I get the following message in the PBS error file
>
> $ cat test.e1234
> [shbli040:08879] mca_base_component_repository_open: unable to open 
> mca_plm_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file 
> or directory (ignored)
> [shbli040:08879] mca_base_component_repository_open: unable to open 
> mca_oob_ud: libibverbs.so.1: cannot open shared object file: No such file or 
> directory (ignored)
> [shbli040:08879] mca_base_component_repository_open: unable to open 
> mca_ras_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file 
> or directory (ignored)
> --
> There are not enough slots available in the system to satisfy the 2 slots
> that were requested by the application:
> ./code.o
>
> Either request fewer slots for your application, or make more slots available
> for use.
> —
>
> The PBS version is
>
> $ qstat --version
> Version: 6.1.2
>
> and here is some additional information on the MPI version
>
> $ mpicc -v
> Using built-in specs.
> COLLECT_GCC=/bin/gcc
> COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
> Target: x86_64-redhat-linux
> […]
> Thread model: posix
> gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC)
>
> Do you guys know what may be the issue here?
>
> Thank you
> Best,
>
>
>
>
>
>
>
>

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

2018-10-04 Thread Castellana Michele

Dear John,
Thank you for your reply. I have tried

ldd mpirun ./code.o

but I get an error message, I do not know what is the proper syntax to use ldd 
command. Here is the information about the Linux version 

$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/;
BUG_REPORT_URL="https://bugs.centos.org/;

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION=“7"

May you please tell me how to check whether the batch system is PBSPro or 
OpenPBS? 

Best,




On Oct 4, 2018, at 10:30 AM, John Hearns via users  
wrote:

Michele  one tip:   log into a compute node using ssh and as your own username.
If you use the Modules envirnonment then load the modules you use in
the job script
then use the  ldd  utility to check if you can load all the libraries
in the code.io executable

Actually you are better to submit a short batch job which does not use
mpirun but uses ldd
A proper batch job will duplicate the environment you wish to run in.

   ldd ./code.io

By the way, is the batch system PBSPro or OpenPBS?  Version 6 seems a bit old.
Can you say what version of Redhat or CentOS this cluster is installed with?



On Thu, 4 Oct 2018 at 00:02, Castellana Michele
 wrote:

I fixed it, the correct file was in /lib64, not in /lib.

Thank you for your help.

On Oct 3, 2018, at 11:30 PM, Castellana Michele  
wrote:

Thank you, I found some libcrypto files in /usr/lib indeed:

$ ls libcry*
libcrypt-2.17.so  libcrypto.so.10  libcrypto.so.1.0.2k  libcrypt.so.1

but I could not find libcrypto.so.0.9.8. Here they suggest to create a 
hyperlink, but if I do I still get an error from MPI. Is there another way 
around this?

Best,

On Oct 3, 2018, at 11:00 PM, Jeff Squyres (jsquyres) via users 
 wrote:

It's probably in your Linux distro somewhere -- I'd guess you're missing a 
package (e.g., an RPM or a deb) out on your compute nodes...?


On Oct 3, 2018, at 4:24 PM, Castellana Michele  
wrote:

Dear Ralph,
Thank you for your reply. Do you know where I could find libcrypto.so.0.9.8 ?

Best,

On Oct 3, 2018, at 9:41 PM, Ralph H Castain  wrote:

Actually, I see that you do have the tm components built, but they cannot be 
loaded because you are missing libcrypto from your LD_LIBRARY_PATH


On Oct 3, 2018, at 12:33 PM, Ralph H Castain  wrote:

Did you configure OMPI —with-tm=? It looks like we didn’t 
build PBS support and so we only see one node with a single slot allocated to 
it.


On Oct 3, 2018, at 12:02 PM, Castellana Michele  
wrote:

Dear all,
I am having trouble running an MPI code across multiple cores on a new computer 
cluster, which uses PBS. Here is a minimal example, where I want to run two MPI 
processes, each on  a different node. The PBS script is

#!/bin/bash
#PBS -l walltime=00:01:00
#PBS -l mem=1gb
#PBS -l nodes=2:ppn=1
#PBS -q batch
#PBS -N test
mpirun -np 2 ./code.o

and when I submit it with

$qsub script.sh

I get the following message in the PBS error file

$ cat test.e1234
[shbli040:08879] mca_base_component_repository_open: unable to open mca_plm_tm: 
libcrypto.so.0.9.8: cannot open shared object file: No such file or directory 
(ignored)
[shbli040:08879] mca_base_component_repository_open: unable to open mca_oob_ud: 
libibverbs.so.1: cannot open shared object file: No such file or directory 
(ignored)
[shbli040:08879] mca_base_component_repository_open: unable to open mca_ras_tm: 
libcrypto.so.0.9.8: cannot open shared object file: No such file or directory 
(ignored)
--
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
./code.o

Either request fewer slots for your application, or make more slots available
for use.
—

The PBS version is

$ qstat --version
Version: 6.1.2

and here is some additional information on the MPI version

$ mpicc -v
Using built-in specs.
COLLECT_GCC=/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
[…]
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC)

Do you guys know what may be the issue here?

Thank you
Best,







___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

2018-10-04 Thread Gilles Gouaillardet

In this case, some Open MPI plugins are missing some third party libraries,
so you would have to ldd all the plugins (e.g. the .so files) located
in /lib/openmpi
in order to evidence any issue.

Cheers,

Gilles

On Thu, Oct 4, 2018 at 4:34 PM John Hearns via users
 wrote:
>
> Michele  one tip:   log into a compute node using ssh and as your own 
> username.
> If you use the Modules envirnonment then load the modules you use in
> the job script
> then use the  ldd  utility to check if you can load all the libraries
> in the code.io executable
>
> Actually you are better to submit a short batch job which does not use
> mpirun but uses ldd
> A proper batch job will duplicate the environment you wish to run in.
>
> ldd ./code.io
>
> By the way, is the batch system PBSPro or OpenPBS?  Version 6 seems a bit old.
> Can you say what version of Redhat or CentOS this cluster is installed with?
>
>
>
> On Thu, 4 Oct 2018 at 00:02, Castellana Michele
>  wrote:
> >
> > I fixed it, the correct file was in /lib64, not in /lib.
> >
> > Thank you for your help.
> >
> > On Oct 3, 2018, at 11:30 PM, Castellana Michele 
> >  wrote:
> >
> > Thank you, I found some libcrypto files in /usr/lib indeed:
> >
> > $ ls libcry*
> > libcrypt-2.17.so  libcrypto.so.10  libcrypto.so.1.0.2k  libcrypt.so.1
> >
> > but I could not find libcrypto.so.0.9.8. Here they suggest to create a 
> > hyperlink, but if I do I still get an error from MPI. Is there another way 
> > around this?
> >
> > Best,
> >
> > On Oct 3, 2018, at 11:00 PM, Jeff Squyres (jsquyres) via users 
> >  wrote:
> >
> > It's probably in your Linux distro somewhere -- I'd guess you're missing a 
> > package (e.g., an RPM or a deb) out on your compute nodes...?
> >
> >
> > On Oct 3, 2018, at 4:24 PM, Castellana Michele 
> >  wrote:
> >
> > Dear Ralph,
> > Thank you for your reply. Do you know where I could find libcrypto.so.0.9.8 
> > ?
> >
> > Best,
> >
> > On Oct 3, 2018, at 9:41 PM, Ralph H Castain  wrote:
> >
> > Actually, I see that you do have the tm components built, but they cannot 
> > be loaded because you are missing libcrypto from your LD_LIBRARY_PATH
> >
> >
> > On Oct 3, 2018, at 12:33 PM, Ralph H Castain  wrote:
> >
> > Did you configure OMPI —with-tm=? It looks like we didn’t 
> > build PBS support and so we only see one node with a single slot allocated 
> > to it.
> >
> >
> > On Oct 3, 2018, at 12:02 PM, Castellana Michele 
> >  wrote:
> >
> > Dear all,
> > I am having trouble running an MPI code across multiple cores on a new 
> > computer cluster, which uses PBS. Here is a minimal example, where I want 
> > to run two MPI processes, each on  a different node. The PBS script is
> >
> > #!/bin/bash
> > #PBS -l walltime=00:01:00
> > #PBS -l mem=1gb
> > #PBS -l nodes=2:ppn=1
> > #PBS -q batch
> > #PBS -N test
> > mpirun -np 2 ./code.o
> >
> > and when I submit it with
> >
> > $qsub script.sh
> >
> > I get the following message in the PBS error file
> >
> > $ cat test.e1234
> > [shbli040:08879] mca_base_component_repository_open: unable to open 
> > mca_plm_tm: libcrypto.so.0.9.8: cannot open shared object file: No such 
> > file or directory (ignored)
> > [shbli040:08879] mca_base_component_repository_open: unable to open 
> > mca_oob_ud: libibverbs.so.1: cannot open shared object file: No such file 
> > or directory (ignored)
> > [shbli040:08879] mca_base_component_repository_open: unable to open 
> > mca_ras_tm: libcrypto.so.0.9.8: cannot open shared object file: No such 
> > file or directory (ignored)
> > --
> > There are not enough slots available in the system to satisfy the 2 slots
> > that were requested by the application:
> >  ./code.o
> >
> > Either request fewer slots for your application, or make more slots 
> > available
> > for use.
> > —
> >
> > The PBS version is
> >
> > $ qstat --version
> > Version: 6.1.2
> >
> > and here is some additional information on the MPI version
> >
> > $ mpicc -v
> > Using built-in specs.
> > COLLECT_GCC=/bin/gcc
> > COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
> > Target: x86_64-redhat-linux
> > […]
> > Thread model: posix
> > gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC)
> >
> > Do you guys know what may be the issue here?
> >
> > Thank you
> > Best,
> >
> >
> >
> >
> >
> >
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> >
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> >
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> >
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
>

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

2018-10-04 Thread John Hearns via users

Michele  one tip:   log into a compute node using ssh and as your own username.
If you use the Modules envirnonment then load the modules you use in
the job script
then use the  ldd  utility to check if you can load all the libraries
in the code.io executable

Actually you are better to submit a short batch job which does not use
mpirun but uses ldd
A proper batch job will duplicate the environment you wish to run in.

ldd ./code.io

By the way, is the batch system PBSPro or OpenPBS?  Version 6 seems a bit old.
Can you say what version of Redhat or CentOS this cluster is installed with?



On Thu, 4 Oct 2018 at 00:02, Castellana Michele
 wrote:
>
> I fixed it, the correct file was in /lib64, not in /lib.
>
> Thank you for your help.
>
> On Oct 3, 2018, at 11:30 PM, Castellana Michele  
> wrote:
>
> Thank you, I found some libcrypto files in /usr/lib indeed:
>
> $ ls libcry*
> libcrypt-2.17.so  libcrypto.so.10  libcrypto.so.1.0.2k  libcrypt.so.1
>
> but I could not find libcrypto.so.0.9.8. Here they suggest to create a 
> hyperlink, but if I do I still get an error from MPI. Is there another way 
> around this?
>
> Best,
>
> On Oct 3, 2018, at 11:00 PM, Jeff Squyres (jsquyres) via users 
>  wrote:
>
> It's probably in your Linux distro somewhere -- I'd guess you're missing a 
> package (e.g., an RPM or a deb) out on your compute nodes...?
>
>
> On Oct 3, 2018, at 4:24 PM, Castellana Michele  
> wrote:
>
> Dear Ralph,
> Thank you for your reply. Do you know where I could find libcrypto.so.0.9.8 ?
>
> Best,
>
> On Oct 3, 2018, at 9:41 PM, Ralph H Castain  wrote:
>
> Actually, I see that you do have the tm components built, but they cannot be 
> loaded because you are missing libcrypto from your LD_LIBRARY_PATH
>
>
> On Oct 3, 2018, at 12:33 PM, Ralph H Castain  wrote:
>
> Did you configure OMPI —with-tm=? It looks like we didn’t 
> build PBS support and so we only see one node with a single slot allocated to 
> it.
>
>
> On Oct 3, 2018, at 12:02 PM, Castellana Michele  
> wrote:
>
> Dear all,
> I am having trouble running an MPI code across multiple cores on a new 
> computer cluster, which uses PBS. Here is a minimal example, where I want to 
> run two MPI processes, each on  a different node. The PBS script is
>
> #!/bin/bash
> #PBS -l walltime=00:01:00
> #PBS -l mem=1gb
> #PBS -l nodes=2:ppn=1
> #PBS -q batch
> #PBS -N test
> mpirun -np 2 ./code.o
>
> and when I submit it with
>
> $qsub script.sh
>
> I get the following message in the PBS error file
>
> $ cat test.e1234
> [shbli040:08879] mca_base_component_repository_open: unable to open 
> mca_plm_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file 
> or directory (ignored)
> [shbli040:08879] mca_base_component_repository_open: unable to open 
> mca_oob_ud: libibverbs.so.1: cannot open shared object file: No such file or 
> directory (ignored)
> [shbli040:08879] mca_base_component_repository_open: unable to open 
> mca_ras_tm: libcrypto.so.0.9.8: cannot open shared object file: No such file 
> or directory (ignored)
> --
> There are not enough slots available in the system to satisfy the 2 slots
> that were requested by the application:
>  ./code.o
>
> Either request fewer slots for your application, or make more slots available
> for use.
> —
>
> The PBS version is
>
> $ qstat --version
> Version: 6.1.2
>
> and here is some additional information on the MPI version
>
> $ mpicc -v
> Using built-in specs.
> COLLECT_GCC=/bin/gcc
> COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
> Target: x86_64-redhat-linux
> […]
> Thread model: posix
> gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC)
>
> Do you guys know what may be the issue here?
>
> Thank you
> Best,
>
>
>
>
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Intermittent failure when launch application linked with OpenMPI 3.1.1

[OMPI users] Memory Leak in 3.1.2 + UCX

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

Re: [OMPI users] Cannot run MPI code on multiple cores with PBS

7 matches

Site Navigation

Mail list logo

Footer information