from:"Prentice Bisbal"

Re: [OMPI users] [EXTERNAL] Re: Newbie With Issues

2021-03-30 Thread Prentice Bisbal via users


This should handle modifying his LD_LIBRARY_PATH correctly


  but doing a intel setvars.sh


How are you doing 'intel setvars.sh'? I believe you need to source that 
rather than execute it. Also, there might be other files you need to 
source. I have access to 2019.u3, and in the install_root, I see these 3 
files:


./bin/ifortvars.sh
./bin/iccvars.sh
./bin/compilervars.sh

I don't see a setvars.sh, so things have probably changed in the latest 
version from 2019.u3, but I would check the documentation and make sure 
your sourcing/executing everything you need to. Some of those scripts 
require arguments, too, if I remember right. We set up environment 
modules to make these changes, and someone else does that now, so I 
don't remember the details.



Prentice

On 3/30/21 1:25 PM, Pritchard Jr., Howard via users wrote:

Hi Ben,

You're heading down the right path

On our HPC systems, we use modules to handle things like setting 
LD_LIBRARY_PATH etc. when using Intel 21.x.y and other Intel compilers.
For example, for the Intel/21.1.1 the following were added to LD_LIBRARY_PATH 
(edited to avoid posting explicit paths on our systems)

prepend-path LD_LIBRARY_PATH /path_to_compiler_install 
/x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/lib:path_to_compiler_install
 /x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/compiler/lib/intel64_lin
prepend-path PATH 
/path_to_compiler_install/x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/bin
prepend-path LD_LIBRARY_PATH /path_to_compiler_install 
/x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/lib/emu
prepend-path LD_LIBRARY_PATH /path_to_compiler_install 
/x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/lib/x64
prepend-path LD_LIBRARY_PATH /path_to_compiler_isnstall 
/x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/lib

You should check which intel compiler libraries you installed and make sure 
you're prepending the relevant folders to LD_LIBRARY_PATH.

We have tested building Open MPI with the Intel OneAPI compilers and except for 
ifx, things went okay.

Howard

On 3/30/21, 11:12 AM, "users on behalf of bend linux4ms.net via users" 
 wrote:

 I think I have found one of the issues. I took the check c program from 
openmpi
 and tried to compile and got the following:

 [root@jean-r8-sch24 benchmarks]# icc dummy.c
 ld: cannot find -lstdc++
 [root@jean-r8-sch24 benchmarks]# cat dummy.c
 int
 main ()
  {

   ;
   return 0;
 }
 [root@jean-r8-sch24 benchmarks]#

 Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 
39212
 "Never attribute to malice, that which can be adequately explained by 
stupidity"
 - Hanlon's Razor




 
 From: users  on behalf of bend linux4ms.net 
via users 
 Sent: Tuesday, March 30, 2021 12:00 PM
 To: Open MPI Users
 Cc: bend linux4ms.net
 Subject: Re: [OMPI users] Newbie With Issues

 Thanks Mr Heinz for responding.

 It maybe the case with clang, but doing a intel setvars.sh then issuing 
the following
 compile gives me the message:

 [root@jean-r8-sch24 openmpi-4.1.0]# icc
 icc: command line error: no files specified; for help type "icc -help"
 [root@jean-r8-sch24 openmpi-4.1.0]# icc -v
 icc version 2021.1 (gcc version 8.3.1 compatibility)
 [root@jean-r8-sch24 openmpi-4.1.0]#

 Would lead me to believe that icc is still available to use.

 This is a government contract and they want the latest and greatest.

 Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 
39212
 "Never attribute to malice, that which can be adequately explained by 
stupidity"
 - Hanlon's Razor




 
 From: Heinz, Michael  William 
 Sent: Tuesday, March 30, 2021 11:52 AM
 To: Open MPI Users
 Cc: bend linux4ms.net
 Subject: RE: Newbie With Issues

 It looks like you're trying to build Open MPI with the Intel C compiler. 
TBH - I think that icc isn't included with the latest release of oneAPI, I 
think they've switched to including clang instead. I had a similar issue to 
yours but I resolved it by installing a 2020 version of the Intel HPC software. 
Unfortunately, those versions require purchasing a license.

 -Original Message-
 From: users  On Behalf Of bend 
linux4ms.net via users
 Sent: Tuesday, March 30, 2021 12:42 PM
 To: Open MPI Open MPI 
 Cc: bend linux4ms.net 
 Subject: [OMPI users] Newbie With Issues

 Hello group, My name is Ben Duncan. I have been tasked with installing 
openMPI and Intel compiler on a HPC systems. I am new to the the whole HPC and 
MPI environment so be patient with me.

 I have successfully gotten the Intel compiler (oneapi version from  
l_HPCKit_p_2021.1.0.2684_offline.sh installed without any errors.

 I am trying to install and configure the openMPI

Re: [OMPI users] [External] Re: Newbie With Issues

2021-03-30 Thread Prentice Bisbal via users

That error message is right in your original post, and I didn't even see 
it:



configure:6541: icc -O2   conftest.c  >&5
ld: cannot find -lstdc++

Well, that's an easy fix.

I guess my eyes stopped when I got to this:


configure: error: C compiler cannot create executables See `config.log' for 
more details

You can ignore my message about /tmp being mounted noexec. ;)

Prentice

On 3/30/21 1:04 PM, bend linux4ms.net via users wrote:

I think I have found one of the issues. I took the check c program from openmpi
and tried to compile and got the following:

[root@jean-r8-sch24 benchmarks]# icc dummy.c
ld: cannot find -lstdc++
[root@jean-r8-sch24 benchmarks]# cat dummy.c
int
main ()
  {

   ;
   return 0;
}
[root@jean-r8-sch24 benchmarks]#

Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 39212
"Never attribute to malice, that which can be adequately explained by stupidity"
- Hanlon's Razor





From: users  on behalf of bend linux4ms.net via 
users 
Sent: Tuesday, March 30, 2021 12:00 PM
To: Open MPI Users
Cc: bend linux4ms.net
Subject: Re: [OMPI users] Newbie With Issues

Thanks Mr Heinz for responding.

It maybe the case with clang, but doing a intel setvars.sh then issuing the 
following
compile gives me the message:

[root@jean-r8-sch24 openmpi-4.1.0]# icc
icc: command line error: no files specified; for help type "icc -help"
[root@jean-r8-sch24 openmpi-4.1.0]# icc -v
icc version 2021.1 (gcc version 8.3.1 compatibility)
[root@jean-r8-sch24 openmpi-4.1.0]#

Would lead me to believe that icc is still available to use.

This is a government contract and they want the latest and greatest.

Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 39212
"Never attribute to malice, that which can be adequately explained by stupidity"
- Hanlon's Razor





From: Heinz, Michael  William 
Sent: Tuesday, March 30, 2021 11:52 AM
To: Open MPI Users
Cc: bend linux4ms.net
Subject: RE: Newbie With Issues

It looks like you're trying to build Open MPI with the Intel C compiler. TBH - 
I think that icc isn't included with the latest release of oneAPI, I think 
they've switched to including clang instead. I had a similar issue to yours but 
I resolved it by installing a 2020 version of the Intel HPC software. 
Unfortunately, those versions require purchasing a license.

-Original Message-
From: users  On Behalf Of bend linux4ms.net 
via users
Sent: Tuesday, March 30, 2021 12:42 PM
To: Open MPI Open MPI 
Cc: bend linux4ms.net 
Subject: [OMPI users] Newbie With Issues

Hello group, My name is Ben Duncan. I have been tasked with installing openMPI 
and Intel compiler on a HPC systems. I am new to the the whole HPC and MPI 
environment so be patient with me.

I have successfully gotten the Intel compiler (oneapi version from  
l_HPCKit_p_2021.1.0.2684_offline.sh installed without any errors.

I am trying to install and configure the openMPI version 4.1.0 however trying 
to run configuration for openmpi gives me the following error:


== Configuring Open MPI


*** Startup tests
checking build system type... x86_64-unknown-linux-gnu checking host system 
type... x86_64-unknown-linux-gnu checking target system type... 
x86_64-unknown-linux-gnu checking for gcc... icc checking whether the C 
compiler works... no
configure: error: in `/p/app/openmpi-4.1.0':
configure: error: C compiler cannot create executables See `config.log' for 
more details

With the error in config.log being:

configure:6499: $? = 0
configure:6488: icc -qversion >&5
icc: command line warning #10006: ignoring unknown option '-qversion'
icc: command line error: no files specified; for help type "icc -help"
configure:6499: $? = 1
configure:6519: checking whether the C compiler works
configure:6541: icc -O2   conftest.c  >&5
ld: cannot find -lstdc++
configure:6545: $? = 1
configure:6583: result: no
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "Open MPI"
| #define PACKAGE_TARNAME "openmpi"
| #define PACKAGE_VERSION "4.1.0"
| #define PACKAGE_STRING "Open MPI 4.1.0"
| #define PACKAGE_BUGREPORT 
"https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.open-mpi.org%2Fcommunity%2Fhelp%2Fdata=04%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C452071550e3c40842a6008d8f39b1ab4%7C4dbdb7da74ee4b458747ef5ce5ebe68a%7C0%7C0%7C637527194795401887%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=MqdORKp%2Fbf6mS7NQ51RjPUe0WVVcBITwP0HpxpYyBjI%3Dreserved=0;
| #define PACKAGE_URL ""
| #define OPAL_ARCH "x86_64-unknown-linux-gnu"
| /* end confdefs.h.  */
|
| int
| main ()
| {
|
|   ;
|   return 0;
| }
configure:6588: error: in `/p/app/openmpi-4.1.0':
configure:6590: error: C compiler cannot create

Re: [OMPI users] [External] Re: Newbie With Issues

2021-03-30 Thread Prentice Bisbal via users


No, icc is there:


configure:6488: icc -qversion >&5
icc: command line warning #10006: ignoring unknown option '-qversion'
icc: command line error: no files specified; for help type "icc -help"


Those error messages are coming directly from icc.

Prentice

On 3/30/21 12:52 PM, Heinz, Michael William via users wrote:

It looks like you're trying to build Open MPI with the Intel C compiler. TBH - 
I think that icc isn't included with the latest release of oneAPI, I think 
they've switched to including clang instead. I had a similar issue to yours but 
I resolved it by installing a 2020 version of the Intel HPC software. 
Unfortunately, those versions require purchasing a license.

-Original Message-
From: users  On Behalf Of bend linux4ms.net 
via users
Sent: Tuesday, March 30, 2021 12:42 PM
To: Open MPI Open MPI 
Cc: bend linux4ms.net 
Subject: [OMPI users] Newbie With Issues

Hello group, My name is Ben Duncan. I have been tasked with installing openMPI 
and Intel compiler on a HPC systems. I am new to the the whole HPC and MPI 
environment so be patient with me.

I have successfully gotten the Intel compiler (oneapi version from  
l_HPCKit_p_2021.1.0.2684_offline.sh installed without any errors.

I am trying to install and configure the openMPI version 4.1.0 however trying 
to run configuration for openmpi gives me the following error:


== Configuring Open MPI


*** Startup tests
checking build system type... x86_64-unknown-linux-gnu checking host system 
type... x86_64-unknown-linux-gnu checking target system type... 
x86_64-unknown-linux-gnu checking for gcc... icc checking whether the C 
compiler works... no
configure: error: in `/p/app/openmpi-4.1.0':
configure: error: C compiler cannot create executables See `config.log' for 
more details

With the error in config.log being:

configure:6499: $? = 0
configure:6488: icc -qversion >&5
icc: command line warning #10006: ignoring unknown option '-qversion'
icc: command line error: no files specified; for help type "icc -help"
configure:6499: $? = 1
configure:6519: checking whether the C compiler works
configure:6541: icc -O2   conftest.c  >&5
ld: cannot find -lstdc++
configure:6545: $? = 1
configure:6583: result: no
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "Open MPI"
| #define PACKAGE_TARNAME "openmpi"
| #define PACKAGE_VERSION "4.1.0"
| #define PACKAGE_STRING "Open MPI 4.1.0"
| #define PACKAGE_BUGREPORT 
"https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.open-mpi.org%2Fcommunity%2Fhelp%2Fdata=04%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C452071550e3c40842a6008d8f39b1ab4%7C4dbdb7da74ee4b458747ef5ce5ebe68a%7C0%7C0%7C637527194795401887%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=MqdORKp%2Fbf6mS7NQ51RjPUe0WVVcBITwP0HpxpYyBjI%3Dreserved=0;
| #define PACKAGE_URL ""
| #define OPAL_ARCH "x86_64-unknown-linux-gnu"
| /* end confdefs.h.  */
|
| int
| main ()
| {
|
|   ;
|   return 0;
| }
configure:6588: error: in `/p/app/openmpi-4.1.0':
configure:6590: error: C compiler cannot create executables See `config.log' 
for more details



My configure line looks like:

./configure --prefix=/p/app/compilers/openmpi-4.1.0/openmpi-4.1.0.intel  
--enable-wrapper-rpath   --disable-libompitrace  
--enable-mpirun-prefix-by-default --enable-mpi-fortran

SO what am I doing wrong , or is it something else ?

Thanks


Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 39212 
"Never attribute to malice, that which can be adequately explained by stupidity"
- Hanlon's Razor

Re: [OMPI users] [External] Newbie With Issues

2021-03-30 Thread Prentice Bisbal via users

Is this your own Linux system, or a work/school system? Some security 
guidelines, like the CIS Security benchmarks, recommending making /tmp 
its own filesystem and mount it with the 'noexec' option. That can cause 
this error. The configure script works by seeing if it can compile 
and/or run small code snippets. It does this in /tmp, so if /tmp is 
mounted 'noexec', you end up with this error.


You can see if /tmp is a separate filesystem with noexec set by looking 
at the output of the 'mount' command.



Prentice

On 3/30/21 12:41 PM, bend linux4ms.net via users wrote:

Hello group, My name is Ben Duncan. I have been tasked with installing openMPI 
and Intel compiler on a HPC systems. I am new to the the whole HPC and MPI 
environment so be patient with me.

I have successfully gotten the Intel compiler (oneapi version from  
l_HPCKit_p_2021.1.0.2684_offline.sh installed without any errors.

I am trying to install and configure the openMPI version 4.1.0
however trying to run configuration for openmpi gives me the following error:


== Configuring Open MPI


*** Startup tests
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking target system type... x86_64-unknown-linux-gnu
checking for gcc... icc
checking whether the C compiler works... no
configure: error: in `/p/app/openmpi-4.1.0':
configure: error: C compiler cannot create executables
See `config.log' for more details

With the error in config.log being:

configure:6499: $? = 0
configure:6488: icc -qversion >&5
icc: command line warning #10006: ignoring unknown option '-qversion'
icc: command line error: no files specified; for help type "icc -help"
configure:6499: $? = 1
configure:6519: checking whether the C compiler works
configure:6541: icc -O2   conftest.c  >&5
ld: cannot find -lstdc++
configure:6545: $? = 1
configure:6583: result: no
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "Open MPI"
| #define PACKAGE_TARNAME "openmpi"
| #define PACKAGE_VERSION "4.1.0"
| #define PACKAGE_STRING "Open MPI 4.1.0"
| #define PACKAGE_BUGREPORT "http://www.open-mpi.org/community/help/;
| #define PACKAGE_URL ""
| #define OPAL_ARCH "x86_64-unknown-linux-gnu"
| /* end confdefs.h.  */
|
| int
| main ()
| {
|
|   ;
|   return 0;
| }
configure:6588: error: in `/p/app/openmpi-4.1.0':
configure:6590: error: C compiler cannot create executables
See `config.log' for more details



My configure line looks like:

./configure --prefix=/p/app/compilers/openmpi-4.1.0/openmpi-4.1.0.intel  
--enable-wrapper-rpath   --disable-libompitrace  
--enable-mpirun-prefix-by-default --enable-mpi-fortran

SO what am I doing wrong , or is it something else ?

Thanks


Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 39212
"Never attribute to malice, that which can be adequately explained by stupidity"
- Hanlon's Razor

Re: [OMPI users] [External] Help with MPI and macOS Firewall

2021-03-18 Thread Prentice Bisbal via users

OpenMPI should only be using shared memory on the local host
automatically, but maybe you need to force it.

I think

mpirun -mca btl self,vader ...

should do that.

or you can exclude tcp instead

mpirun -mca btl ^tcp

See

https://www.open-mpi.org/faq/?category=sm

for more info.

Prentice

On 3/18/21 12:28 PM, Matt Thompson via users wrote:

All,

This isn't specifically an Open MPI issue, but as that is the MPI
stack I use on my laptop, I'm hoping someone here might have a
possible solution. (I am pretty sure something like MPICH would
trigger this as well.)

Namely, my employer recently did something somewhere so that now *any*
MPI application I run will throw popups like this one:

https://user-images.githubusercontent.com/4114656/30962814-866f3010-a44b-11e7-9de3-9f2a3b0229c0.png

though for me it's asking about "orterun" and "helloworld.mpi3.exe",
etc. I essentially get one-per-process.

If I had sudo access, I suppose I could just keep clicking "Allow" for
every program, but I don't and I compile lots of programs with
different names.

So, I was hoping maybe an Open MPI guru out there knew of an MCA thing
I could use to avoid them? This is all isolated on-my-laptop MPI I'm
doing, so at most an "mpirun --oversubscribe -np 12" or something.
It'll never go over my network to anything, etc.

--
Matt Thompson
“The fact is, this is about us identifying what we do best and
finding more ways of doing less of it better” -- Director of Better
Anna Rampton

Re: [OMPI users] [External] Re: Error intialising an OpenFabrics device.

2021-03-18 Thread Prentice Bisbal via users


  If you disable it with -mtl ^openib the warning goes away.

And the performance of openib goes away right along with it.

Prentice

On 3/13/21 5:43 PM, Heinz, Michael William via users wrote:

I’ve begun getting this annoyingly generic warning, too. It appears to be 
coming from the openib provider. If you disable it with -mtl ^openib the 
warning goes away.

Sent from my iPad


On Mar 13, 2021, at 3:28 PM, Bob Beattie via users  
wrote:

Hi everyone,

To be honest, as an MPI / IB noob, I don't know if this falls under OpenMPI or 
Mellanox

Am running a small cluster of HP DL380 G6/G7 machines.
Each runs Ubuntu server 20.04 and has a Mellanox ConnectX-3 card, connected by 
an IS dumb switch.
When I begin my MPI program (snappyHexMesh for OpenFOAM) I get an error 
reported.
The error doesn't stop my programs or appear to cause any problems, so this 
request for help is more about delving into the why.

OMPI is compiled from source using v4.0.3; which is the default version for 
Ubuntu 20.04
This compiles and works.  I did this because I wanted to understand the 
compilation process whilst using a known working OMPI version.

The Infiniband part is the Mellanox MLNXOFED installer v4.9-0.1.7.0 and I 
install that with --dkms --without-fw-update --hpc --with-nfsrdma

The actual error reported is:
Warning: There was an error initialising an OpenFabrics device.
   Local host: of1
   Local device: mlx4_0

Then shortly after:
[of1:1015399] 19 more processes have sent help message help-mpi-btl-openib.txt 
/ error in device init
[of1:1015399] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help 
/ error messages

Adding this MCA parameter to the mpirun line simply gives me 20 or so copies of 
the first warning.

Any ideas anyone ?
Cheers,
Bob.

Re: [OMPI users] [External] Re: mpi/pmix: ERROR: Error handler invoked: status = -25: No such file or directory (2)

2020-11-12 Thread Prentice Bisbal via users


That's what I suspected. Thanks for confirming.

Prentice

On 11/12/20 1:46 PM, Ralph Castain via users wrote:

Yeah - this can be safely ignored. Basically, what's happening is an async 
cleanup of a tmp directory and the code is barking that it wasn't found 
(because it was already deleted).



On Nov 12, 2020, at 8:16 AM, Prentice Bisbal via users 
 wrote:

I should give more background. In the slurm error log for this job, there was 
another error about a memcpy operation failing listed first, so that caused the 
job to fail. I suspect these errors below are the result of the other MPI ranks 
being killed in a not exactly simultaneous manner, which is to be expected. I 
just want to make sure that this was the case, and the error below wasn't a 
sign of another issue with the job.

Prentice

On 11/11/20 5:47 PM, Ralph Castain via users wrote:

Looks like it is coming from the Slurm PMIx plugin, not OMPI.

Artem - any ideas?
Ralph



On Nov 11, 2020, at 10:03 AM, Prentice Bisbal via users 
 wrote:

One of my users recently reported a failed job that was using OpenMPI 4.0.4 
compiled with PGI 20.4. There  two different errors reported. One was reported 
once, and I think had nothing to do with OpenMPI or PMIX, and then this error 
was repeated multiple times in the Slurm error output for the job:

pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: 
status = -25: No such file or directory (2)

Anyone else see this before? Any idea what would cause this error? I did a 
google search but couldn't find any discussion of this error anywhere.

--
Prentice


--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov




--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Re: [OMPI users] [External] Re: mpi/pmix: ERROR: Error handler invoked: status = -25: No such file or directory (2)

2020-11-12 Thread Prentice Bisbal via users

I should give more background. In the slurm error log for this job, 
there was another error about a memcpy operation failing listed first, 
so that caused the job to fail. I suspect these errors below are the 
result of the other MPI ranks being killed in a not exactly simultaneous 
manner, which is to be expected. I just want to make sure that this was 
the case, and the error below wasn't a sign of another issue with the job.


Prentice

On 11/11/20 5:47 PM, Ralph Castain via users wrote:

Looks like it is coming from the Slurm PMIx plugin, not OMPI.

Artem - any ideas?
Ralph



On Nov 11, 2020, at 10:03 AM, Prentice Bisbal via users 
 wrote:

One of my users recently reported a failed job that was using OpenMPI 4.0.4 
compiled with PGI 20.4. There  two different errors reported. One was reported 
once, and I think had nothing to do with OpenMPI or PMIX, and then this error 
was repeated multiple times in the Slurm error output for the job:

pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: 
status = -25: No such file or directory (2)

Anyone else see this before? Any idea what would cause this error? I did a 
google search but couldn't find any discussion of this error anywhere.

--
Prentice




--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

[OMPI users] mpi/pmix: ERROR: Error handler invoked: status = -25: No such file or directory (2)

2020-11-11 Thread Prentice Bisbal via users

One of my users recently reported a failed job that was using OpenMPI 
4.0.4 compiled with PGI 20.4. There  two different errors reported. One 
was reported once, and I think had nothing to do with OpenMPI or PMIX, 
and then this error was repeated multiple times in the Slurm error 
output for the job:


pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler 
invoked: status = -25: No such file or directory (2)


Anyone else see this before? Any idea what would cause this error? I did 
a google search but couldn't find any discussion of this error anywhere.


--
Prentice

Re: [OMPI users] [External] Re: mpirun on Kubuntu 20.4.1 hangs

2020-10-22 Thread Prentice Bisbal via users

 Could SELinux or AppArmor be active by default for a new install and 
be causing this problem?


Prentice

On 10/21/20 12:22 PM, Jorge SILVA via users wrote:


Hello Gus,

 Thank you for your answer..  Unfortunately my problem is much more 
basic. I  didn't try to run the program in both computers , but just 
to run something in one computer. I just installed the new OS an 
openmpi in two different computers, in the standard way, with the same 
result.


For example:

In kubuntu20.4.1 LTS with openmpi 4.0.3-0ubuntu

jorge@gcp26:~/MPIRUN$ cat hello.f90
 print*,"Hello World!"
end
jorge@gcp26:~/MPIRUN$ mpif90 hello.f90 -o hello
jorge@gcp26:~/MPIRUN$ ./hello
 Hello World!
jorge@gcp26:~/MPIRUN$ mpirun -np 1 hello <---here the program hangs 
with no output

^C^Cjorge@gcp26:~/MPIRUN$

The mpirun task sleeps with no output, and only twice ctrl-C ends the 
execution  :


jorge   5540  0.1  0.0  44768  8472 pts/8    S+ 17:54   0:00 
mpirun -np 1 hello


In kubuntu 18.04.5 LTS with openmpi 2.1.1, of course, the same program 
gives


jorge@gcp30:~/MPIRUN$ cat hello.f90
 print*, "Hello World!"
 END
jorge@gcp30:~/MPIRUN$ mpif90 hello.f90 -o hello
jorge@gcp30:~/MPIRUN$ ./hello
 Hello World!
jorge@gcp30:~/MPIRUN$ mpirun -np 1 hello
 Hello World
jorge@gcp30:~/MPIRUN$


Even just typing mpirun hangs without the usual error message.

Are there any changes between the two versions of openmpi that I 
miss?  Some package lacking to mpirun ?


Thank you again for your help

Jorge


Le 21/10/2020 à 00:20, Gus Correa a écrit :

Hi Jorge

You may have an active firewall protecting either computer or both,
and preventing mpirun to start the connection.
Your /etc/hosts file may also not have the computer IP addresses.
You may also want to try the --hostfile option.
Likewise, the --verbose option may also help diagnose the problem.

It would help if you send the mpirun command line, the hostfile (if any),
error message if any, etc.


These FAQs may help diagnose and solve the problem:

https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
https://www.open-mpi.org/faq/?category=running#mpirun-hostfile
https://www.open-mpi.org/faq/?category=running

I hope this helps,
Gus Correa

On Tue, Oct 20, 2020 at 4:47 PM Jorge SILVA via users 
mailto:users@lists.open-mpi.org>> wrote:


Hello,

I installed kubuntu20.4.1 with openmpi 4.0.3-0ubuntu in two
different
computers in the standard way. Compiling with mpif90 works, but
mpirun
hangs with no output in both systems. Even mpirun command without
parameters hangs and only twice ctrl-C typing can end the sleeping
program. Only  the command

 mpirun --help

gives the usual output.

Seems that is something related to the terminal output, but the
command
worked well for Kubuntu 18.04. Is there a way to debug or fix this
problem (without re-compiling from sources, etc)? Is it a known
problem?

Thanks,

  Jorge

Re: [OMPI users] [External] Re: MPI is still dominantparadigm?

2020-08-07 Thread Prentice Bisbal via users

If you want to continue this conversation in a more appropriate forum, 
may I recommend the Beowulf mailing list? Discussing *anything* 
HPC-related is fair game there. It's a low-volume list, but the 
conversation can get quite lively sometimes.


https://www.beowulf.org/mailman/listinfo/beowulf

On 8/7/20 11:37 AM, Gilles Gouaillardet via users wrote:


 The goal of Open MPI is to provide a high quality of the MPI standard,

and the goal of this mailing list is to discuss Open MPI (and not the 
MPI standard)


The Java bindings support "recent" JDK, and if you face an issue, 
please report a bug (either here or on github)


Cheers,

Gilles

- Original Message -

Hello,
This may be a bit of a longer post and I am not sure if it is even
appropriate here but I figured I ask. There are no hidden agendas
in it, so please treat it as "asking for opinions/advice", as
opposed to judging or provoking.
For the period between 2010 to 2017 I used to work in (buzzword
alert!) "big data" (meaning Spark, HDFS, reactive stuff like Akka)
but way before that in the early 2000s I used to write basic
multithreaded C and some MPI code. I came back to HPC/academia two
years ago and what struck me was that (for lack of better word)
the field is still "stuck" (again, for lack of better word) on
MPI. This itself may seem negative in this context, however, I am
just stating my observation, which may be wrong.
I like low level programming and I like being in control of what
is going on but having had the experience in Spark and Akka, I
kind of got spoiled. Yes, I understand that the latter has
fault-tolerance (which is nice) and MPI doesn't (or at least,
didn't when I played with in 1999-2005) but I always felt like MPI
needed higher level abstractions as a CHOICE (not _only_ choice)
laid over the bare metal offerings. The whole world has moved onto
programming in patterns and higher level abstractions, why is the
academic/HPC world stuck on bare metal, still? Yes, I understand
that performance often matters and the higher up you go, the more
performance loss you incur, however, there is also something to be
said about developer time and ease of understanding/abstracting
etc. etc.
Be that as it may, I am working on a project now in the HPC world
and I noticed that Open MPI has Java bindings (or should I say
"interface"?). What is the state of those? Which JDK do they
support? Most importantly, would it be a HUGE pipe dream to think
about building patterns a-la Akka (or even mixing actual Akka
implementation) on top of OpenMPI via this Java bridge? What would
be involved on the OpenMPI side? I have time/interest in going
this route if there would be any hope of coming up with something
that would make my life (and future people coming into HPC/MPI)
easier in terms of building applications. I am not saying MPI in
C/C++/Fortran should go away, however, sometimes we don't need the
low-level stuff to express a concept :-). It may also open a whole
new world for people on large clusters...
Thank you!


--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Re: [OMPI users] [External] Books/resources to learn (open)MPI from

2020-08-06 Thread Prentice Bisbal via users

The reason there aren't a lot of "new" books on MPI programming is 
because the standard is pretty stable and the paradigm hasn't really 
changed since the first version of the standard came out in the mid-90s. 
I believe newer versions of the MPI standard have added new features, 
but haven't really changed the original features (I'm sure many people 
on this list will correct me I'm wrong about that!)


Also, MPI programming is a niche market compared to many other types of 
programming, so there's not a lot of money in making books about MPI 
porgramming.


I'm not an MPI programmer by profession, but I did take some graduate 
classes on it, and I think this is the best book for learning MPI:


https://www.amazon.com/Using-MPI-Programming-Message-Passing-Engineering/dp/0262527391/

It's writing is simple and to the point - it's very readable. By the end 
of the first couple of chapters, you'll know enough to get started MPI 
programming. If you finish that book and want to learn more of the MPI 
standard, you're in luck, because they made a sequel:


https://www.amazon.com/Using-Advanced-MPI-Message-Passing-Engineering/dp/0262527634/

--
Prentice

On 8/5/20 10:17 PM, Oddo Da via users wrote:
My apologies if this has been asked before, however a Google search on 
books about (open)MPI returns a bunch of material written between 
1990s and early 2000s, it is difficult to find anything "fresh" on 
(open)MPI programming. In comparison, if someone wants to learn about 
Spark and distributed computing - there are tons of courses, videos 
and books online. Has (open)MPI and the whole paradigm not changed at 
all in the last 20 years (are the old books still relevant)? Where do 
people learn about distributed/parallel programming using (open)MPI?


I did find 
https://www.amazon.com/Introduction-Science-Undergraduate-Topics-Computer-dp-3319219022/dp/3319219022/ref=mt_other?_encoding=UTF8== 
(Nielsen's "Introduction to HPC with MPI for Data Science", which was 
published in 2016 but it has no reviews). I also found "Using Advanced 
MPI: Modern Features of the Message-Passing Interface" which was 
published in 2015 but it too has no reviews on Amazon and is upwards 
of $110+ (!).


Any suggestions welcome. Thank you!


--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Re: [OMPI users] [External] Correct mpirun Options for Hybrid OpenMPI/OpenMP

2020-08-03 Thread Prentice Bisbal via users

If P=1 and Q=1, your setting up a 1x1 matrix which should only need a 
single processor. Something tells me you have 4 independent HPL jobs 
running, rather than one job using 4 threads. I think you should have 
2x2 grid if you want to use 4 threads. For HPL, P * Q = number of cores 
being used.


Prentice

On 8/3/20 4:33 AM, John Duffy via users wrote:

Hi

I’m experimenting with hybrid OpenMPI/OpenMP Linpack benchmarks on my 
small cluster, and I’m a bit confused as to how to invoke mpirun.


I have compiled/linked HPL-2.3 with OpenMPI and libopenblas-openmp 
using the GCC -fopenmp option on Ubuntu 20.04 64-bit.


With P=1 and Q=1 in HPL.dat, if I use…

mpirun -x OMP_NUM_THREADS=4 xhpl

top reports...
top - 08:03:59 up 1 day, 0 min,  1 user,  load average: 2.25, 1.23, 0.88
Tasks: 138 total,   2 running, 136 sleeping,   0 stopped,   0 zombie
%Cpu(s): 77.1 us, 22.2 sy,  0.0 ni,  0.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3793.3 total,    434.0 free,   2814.1 used,   545.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   919.9 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM   TIME+ 
COMMAND
   5787 john      20   0 2959408   2.6g   8128 R 354.0  69.1   2:10.43 
xhpl
   5789 john      20   0  263352   9960   7440 S  14.2   0.3   0:07.42 
xhpl
   5788 john      20   0  263352   9844   7320 S  13.9   0.3   0:07.19 
xhpl
   5790 john      20   0  263356   9896   7376 S  13.6   0.3   0:07.17 
xhpl


… which seems reasonable, but I don’t understand why there are 4 xhpl 
processes.



In anticipation of adding more nodes, if I use…

mpirun --host node1 --map-by ppr:1:node -x OMP_NUM_THREADS=4 xhpl

top reports...

top - 07:56:27 up 23:52,  1 user,  load average: 1.00, 0.98, 0.68
Tasks: 133 total,   2 running, 131 sleeping,   0 stopped,   0 zombie
%Cpu(s): 25.1 us,  0.0 sy,  0.0 ni, 74.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3793.3 total,    454.2 free,   2794.5 used,   544.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   939.9 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM   TIME+ 
COMMAND
   5770 john      20   0 2868700   2.5g   7668 R  99.7  68.7   5:20.37 
xhpl


… a single xhpl process (as expected), but with only 25% CPU 
utilisation and no other processes running on the other 3 cores. It 
would appear OpenBLAS is not utilising the 4 cores as expected.



If I then scale it to 2 nodes, with P=1 and Q=2 in HPL.dat...

mpirun --host node1,node2 --map-by ppr:1:node -x OMP_NUM_THREADS=4 xhpl

… similarly, I get a single process on each node, with only 25% CPU 
utilisation.



Any advice/suggestions on how to involve mpirun in a hybrid 
OpenMPI/OpenMP setup would be appreciated.


Kind regards




--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Re: [OMPI users] WARNING: There was an error initializing an OpenFabrics device

2020-07-29 Thread Prentice Bisbal via users

Okay, I got this fixed. Apparently, 'make install' wasn't overwriting 
the previous install, so I had to manually delete my previous install 
before doing 'make install'. Once I did that, using UCX 1.8.1 and 
specifying --without-verbs worked.


Prentice

On 7/28/20 2:03 PM, Prentice Bisbal wrote:


Last week I posted on here that I was getting immediate segfaults when 
I ran MPI programs, and the system logs shows that the segfaults were 
occuring in libibverbs.so, and that the problem was still occurring 
even if I specified '-mca btl ^openib'.


Since then, I've made a lot of progress on the problem, and now my 
jobs run, but I'm now getting this error sent to standard error:


WARNING: There was an error initializing an OpenFabrics device.

  Local host:   greene021
  Local device: qib0

For the record, I'm using OpenMPI 4.0.3 running on CentOS 7.8, 
compiled with GCC 9.3.0.


While researching the immediate segfault issue, I came across this Red 
Hat Bug Report:


https://bugzilla.redhat.com/show_bug.cgi?id=1754099

According to that bug report, there was a regression in the version of 
UCX that was provided with CentOS 7.8 (UCX 1.5.2-1.el7), and 
downgrading to the UCX package that came with CentOS 7.7 (UCX 
1.4.0-1.el7). Suspecting this might be the cause of my problem, I did 
the same.


After the downgrade, my jobs still segfaulted, but at least I now got 
a backtrace showing that the segfault was happening in UCX.


Now I suspected a bug in UCX, so I went to the UCX website and 
installed the latest stable version (1.8.1) by building the SRPM 
provided by the UCX website:


https://github.com/openucx/ucx/releases/tag/v1.8.1

After that, my application runs, but I get the error message above 
(repeated here):


WARNING: There was an error initializing an OpenFabrics device.

  Local host:   greene021
  Local device: qib0

Googling for that error message, I came across this OpenMPI bug 
discussion:


https://github.com/open-mpi/ompi/issues/6517

According to this, if I rebuild OpenMPI with the option 
''--without-verbs", that message will go away. I tried that, but I am 
still getting the error message. Here's the configure command-line, 
taken from ompi_info:


Configure command line: 
'--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3' '--with-ucx' 
'--without-verbs' '--with-libfabric' '--with-libevent=/usr' 
'--with-libevent-libdir=/usr/lib64' '--with-pmix=/usr/pppl/pmix/3.1.5' 
'--with-pmi'


I have two questions:

1. How can I be sure that this message is really just a result of the 
old openib code (as stated in the OpenMPI bug discussion above), and 
my job is actually using InfiniBand with UCX?


2. If the message above is harmless, how can I make it go away so my 
users don't see it?


If you've made it this far, thanks for reading my whole message. Any 
help will be greatly appreciated!


--
Prentice


--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Re: [OMPI users] WARNING: There was an error initializing an OpenFabrics device

2020-07-28 Thread Prentice Bisbal via users


One more bit of information: These are QLogic IB cards, not Mellanox:

$ lspci | grep QL
05:00.0 InfiniBand: QLogic Corp. IBA7322 QDR InfiniBand HCA (rev 02)

On 7/28/20 2:03 PM, Prentice Bisbal wrote:


Last week I posted on here that I was getting immediate segfaults when 
I ran MPI programs, and the system logs shows that the segfaults were 
occuring in libibverbs.so, and that the problem was still occurring 
even if I specified '-mca btl ^openib'.


Since then, I've made a lot of progress on the problem, and now my 
jobs run, but I'm now getting this error sent to standard error:


WARNING: There was an error initializing an OpenFabrics device.

  Local host:   greene021
  Local device: qib0

For the record, I'm using OpenMPI 4.0.3 running on CentOS 7.8, 
compiled with GCC 9.3.0.


While researching the immediate segfault issue, I came across this Red 
Hat Bug Report:


https://bugzilla.redhat.com/show_bug.cgi?id=1754099

According to that bug report, there was a regression in the version of 
UCX that was provided with CentOS 7.8 (UCX 1.5.2-1.el7), and 
downgrading to the UCX package that came with CentOS 7.7 (UCX 
1.4.0-1.el7). Suspecting this might be the cause of my problem, I did 
the same.


After the downgrade, my jobs still segfaulted, but at least I now got 
a backtrace showing that the segfault was happening in UCX.


Now I suspected a bug in UCX, so I went to the UCX website and 
installed the latest stable version (1.8.1) by building the SRPM 
provided by the UCX website:


https://github.com/openucx/ucx/releases/tag/v1.8.1

After that, my application runs, but I get the error message above 
(repeated here):


WARNING: There was an error initializing an OpenFabrics device.

  Local host:   greene021
  Local device: qib0

Googling for that error message, I came across this OpenMPI bug 
discussion:


https://github.com/open-mpi/ompi/issues/6517

According to this, if I rebuild OpenMPI with the option 
''--without-verbs", that message will go away. I tried that, but I am 
still getting the error message. Here's the configure command-line, 
taken from ompi_info:


Configure command line: 
'--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3' '--with-ucx' 
'--without-verbs' '--with-libfabric' '--with-libevent=/usr' 
'--with-libevent-libdir=/usr/lib64' '--with-pmix=/usr/pppl/pmix/3.1.5' 
'--with-pmi'


I have two questions:

1. How can I be sure that this message is really just a result of the 
old openib code (as stated in the OpenMPI bug discussion above), and 
my job is actually using InfiniBand with UCX?


2. If the message above is harmless, how can I make it go away so my 
users don't see it?


If you've made it this far, thanks for reading my whole message. Any 
help will be greatly appreciated!


--
Prentice


--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

[OMPI users] WARNING: There was an error initializing an OpenFabrics device

2020-07-28 Thread Prentice Bisbal via users

Last week I posted on here that I was getting immediate segfaults when I 
ran MPI programs, and the system logs shows that the segfaults were 
occuring in libibverbs.so, and that the problem was still occurring even 
if I specified '-mca btl ^openib'.


Since then, I've made a lot of progress on the problem, and now my jobs 
run, but I'm now getting this error sent to standard error:


WARNING: There was an error initializing an OpenFabrics device.

  Local host:   greene021
  Local device: qib0

For the record, I'm using OpenMPI 4.0.3 running on CentOS 7.8, compiled 
with GCC 9.3.0.


While researching the immediate segfault issue, I came across this Red 
Hat Bug Report:


https://bugzilla.redhat.com/show_bug.cgi?id=1754099

According to that bug report, there was a regression in the version of 
UCX that was provided with CentOS 7.8 (UCX 1.5.2-1.el7), and downgrading 
to the UCX package that came with CentOS 7.7 (UCX 1.4.0-1.el7). 
Suspecting this might be the cause of my problem, I did the same.


After the downgrade, my jobs still segfaulted, but at least I now got a 
backtrace showing that the segfault was happening in UCX.


Now I suspected a bug in UCX, so I went to the UCX website and installed 
the latest stable version (1.8.1) by building the SRPM provided by the 
UCX website:


https://github.com/openucx/ucx/releases/tag/v1.8.1

After that, my application runs, but I get the error message above 
(repeated here):


WARNING: There was an error initializing an OpenFabrics device.

  Local host:   greene021
  Local device: qib0

Googling for that error message, I came across this OpenMPI bug discussion:

https://github.com/open-mpi/ompi/issues/6517

According to this, if I rebuild OpenMPI with the option 
''--without-verbs", that message will go away. I tried that, but I am 
still getting the error message. Here's the configure command-line, 
taken from ompi_info:


Configure command line: '--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3' 
'--with-ucx' '--without-verbs' '--with-libfabric' '--with-libevent=/usr' 
'--with-libevent-libdir=/usr/lib64' '--with-pmix=/usr/pppl/pmix/3.1.5' 
'--with-pmi'


I have two questions:

1. How can I be sure that this message is really just a result of the 
old openib code (as stated in the OpenMPI bug discussion above), and my 
job is actually using InfiniBand with UCX?


2. If the message above is harmless, how can I make it go away so my 
users don't see it?


If you've made it this far, thanks for reading my whole message. Any 
help will be greatly appreciated!


--
Prentice

Re: [OMPI users] [External] Re: segfault in libibverbs.so

2020-07-28 Thread Prentice Bisbal via users

I've been doing a lot of research on this issue (See my next e-mail on 
this topic which I'll be posting ina  few minutes), and OpenMPI will use 
ibverbs or UCX. In OpenMPI 4.0 and later, ibverbs is deprecated in favor 
of UCX.


Prentice

On 7/27/20 7:49 PM, gil...@rist.or.jp wrote:

Prentice,

ibverbs might be used by UCX (either pml/ucx or btl/uct),
so to be 100% sure, you should

mpirun --mca pml ob1 --mca btl ^openib,uct ...

in order to force btl/tcp, you need to ensure pml/ob1 is used,
and then you always need the btl/self component

mpirun --mca pml ob1 --mca btl tcp,self ...

Cheers,

Gilles

- Original Message -

Can anyone explain why my job still calls libibverbs when I run it

with

'-mca btl ^openib'?

If I instead use '-mca btl tcp', my jobs don't segfault. I would assum
'mca btl ^openib' and '-mca btl tcp' to essentially be equivalent, but
there's obviously a difference in the two.

Prentice

On 7/23/20 3:34 PM, Prentice Bisbal wrote:

I manage a cluster that is very heterogeneous. Some nodes have
InfiniBand, while others have 10 Gb/s Ethernet. We recently upgraded
to CentOS 7, and built a new software stack for CentOS 7. We are

using

OpenMPI 4.0.3, and we are using Slurm 19.05.5 as our job scheduler.

We just noticed that when jobs are sent to the nodes with IB, the
segfault immediately, with the segfault appearing to come from
libibverbs.so. This is what I see in the stderr output for one of
these failed jobs:

srun: error: greene021: tasks 0-3: Segmentation fault

And here is what I see in the log messages of the compute node where
that segfault happened:

Jul 23 15:19:41 greene021 kernel: mpihello[7911]: segfault at
7f0635f38910 ip 7f0635f49405 sp 7ffe354485a0 error 4
Jul 23 15:19:41 greene021 kernel: mpihello[7912]: segfault at
7f23d51ea910 ip 7f23d51fb405 sp 7ffef250a9a0 error 4
Jul 23 15:19:41 greene021 kernel: in
libibverbs.so.1.5.22.4[7f23d51ec000+18000]
Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: mpihello[7909]: segfault at
7ff504ba5910 ip 7ff504bb6405 sp 7917ccb0 error 4
Jul 23 15:19:41 greene021 kernel: in
libibverbs.so.1.5.22.4[7ff504ba7000+18000]
Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: mpihello[7910]: segfault at
7fa58abc5910 ip 7fa58abd6405 sp 7ffdde50c0d0 error 4
Jul 23 15:19:41 greene021 kernel: in
libibverbs.so.1.5.22.4[7fa58abc7000+18000]
Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: in
libibverbs.so.1.5.22.4[7f0635f3a000+18000]
Jul 23 15:19:41 greene021 kernel

Any idea what is going on here, or how to debug further? I've been
using OpenMPI for years, and it usually just works.

I normally start my job with srun like this:

srun ./mpihello

But even if I try to take IB out of the equation by starting the job
like this:

mpirun -mca btl ^openib ./mpihello

I still get a segfault issue, although the message to stderr is now

a

little different:



--

Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


--



--

mpirun noticed that process rank 1 with PID 8502 on node greene021
exited on signal 11 (Segmentation fault).


--


The segfaults happens immediately. It seems to happen as soon as
MPI_Init() is called. The program I'm running is very simple MPI
"Hello world!" program.

The output of  ompi_info is below my signature, in case that helps.

Prentice

$ ompi_info
  Package: Open MPI u...@host.example.com

Distribution

     Open MPI: 4.0.3
   Open MPI repo revision: v4.0.3
    Open MPI release date: Mar 03, 2020
     Open RTE: 4.0.3
   Open RTE repo revision: v4.0.3
    Open RTE release date: Mar 03, 2020
     OPAL: 4.0.3
   OPAL repo revision: v4.0.3
    OPAL release date: Mar 03, 2020
  MPI API: 3.1.0
     Ident string: 4.0.3
   Prefix: /usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3
  Configured architecture: x86_64-unknown-linux-gnu
   Configure host: dawson027.pppl.gov
    Configured by: lglant
    Configured on: Mon Jun  1 12:37:07 EDT 2020
   Configure host: dawson027.pppl.gov
   Configure command line: '--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.

0.3'

   '--with-ucx' '--with-verbs' '--with-

libfabric'

   '--with-libevent=/usr'
'--with-libevent-libdir=/usr/lib64'
'--with-pmix=/usr/pppl/pmix/3.1.5' '--with-pmi'
     Built by: lglant
     Built on: Mon Jun  1 13:05:40 EDT 2020
   Built host: dawson027.pppl.gov
   C bindings: yes
     C+

Re: [OMPI users] segfault in libibverbs.so

2020-07-27 Thread Prentice Bisbal via users

Can anyone explain why my job still calls libibverbs when I run it with 
'-mca btl ^openib'?


If I instead use '-mca btl tcp', my jobs don't segfault. I would assum 
'mca btl ^openib' and '-mca btl tcp' to essentially be equivalent, but 
there's obviously a difference in the two.


Prentice

On 7/23/20 3:34 PM, Prentice Bisbal wrote:
I manage a cluster that is very heterogeneous. Some nodes have 
InfiniBand, while others have 10 Gb/s Ethernet. We recently upgraded 
to CentOS 7, and built a new software stack for CentOS 7. We are using 
OpenMPI 4.0.3, and we are using Slurm 19.05.5 as our job scheduler.


We just noticed that when jobs are sent to the nodes with IB, the 
segfault immediately, with the segfault appearing to come from 
libibverbs.so. This is what I see in the stderr output for one of 
these failed jobs:


srun: error: greene021: tasks 0-3: Segmentation fault

And here is what I see in the log messages of the compute node where 
that segfault happened:


Jul 23 15:19:41 greene021 kernel: mpihello[7911]: segfault at 
7f0635f38910 ip 7f0635f49405 sp 7ffe354485a0 error 4
Jul 23 15:19:41 greene021 kernel: mpihello[7912]: segfault at 
7f23d51ea910 ip 7f23d51fb405 sp 7ffef250a9a0 error 4
Jul 23 15:19:41 greene021 kernel: in 
libibverbs.so.1.5.22.4[7f23d51ec000+18000]

Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: mpihello[7909]: segfault at 
7ff504ba5910 ip 7ff504bb6405 sp 7917ccb0 error 4
Jul 23 15:19:41 greene021 kernel: in 
libibverbs.so.1.5.22.4[7ff504ba7000+18000]

Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: mpihello[7910]: segfault at 
7fa58abc5910 ip 7fa58abd6405 sp 7ffdde50c0d0 error 4
Jul 23 15:19:41 greene021 kernel: in 
libibverbs.so.1.5.22.4[7fa58abc7000+18000]

Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: in 
libibverbs.so.1.5.22.4[7f0635f3a000+18000]

Jul 23 15:19:41 greene021 kernel

Any idea what is going on here, or how to debug further? I've been 
using OpenMPI for years, and it usually just works.


I normally start my job with srun like this:

srun ./mpihello

But even if I try to take IB out of the equation by starting the job 
like this:


mpirun -mca btl ^openib ./mpihello

I still get a segfault issue, although the message to stderr is now a 
little different:


-- 


Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-- 

-- 

mpirun noticed that process rank 1 with PID 8502 on node greene021 
exited on signal 11 (Segmentation fault).
-- 



The segfaults happens immediately. It seems to happen as soon as 
MPI_Init() is called. The program I'm running is very simple MPI 
"Hello world!" program.


The output of  ompi_info is below my signature, in case that helps.

Prentice

$ ompi_info
 Package: Open MPI u...@host.example.com Distribution
    Open MPI: 4.0.3
  Open MPI repo revision: v4.0.3
   Open MPI release date: Mar 03, 2020
    Open RTE: 4.0.3
  Open RTE repo revision: v4.0.3
   Open RTE release date: Mar 03, 2020
    OPAL: 4.0.3
  OPAL repo revision: v4.0.3
   OPAL release date: Mar 03, 2020
 MPI API: 3.1.0
    Ident string: 4.0.3
  Prefix: /usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3
 Configured architecture: x86_64-unknown-linux-gnu
  Configure host: dawson027.pppl.gov
   Configured by: lglant
   Configured on: Mon Jun  1 12:37:07 EDT 2020
  Configure host: dawson027.pppl.gov
  Configure command line: '--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3'
  '--with-ucx' '--with-verbs' '--with-libfabric'
  '--with-libevent=/usr'
'--with-libevent-libdir=/usr/lib64'
'--with-pmix=/usr/pppl/pmix/3.1.5' '--with-pmi'
    Built by: lglant
    Built on: Mon Jun  1 13:05:40 EDT 2020
  Built host: dawson027.pppl.gov
  C bindings: yes
    C++ bindings: no
 Fort mpif.h: yes (all)
    Fort use mpi: yes (full: ignore TKR)
   Fort use mpi size: deprecated-ompi-info-value
    Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
  limitations in the gfortran compiler and/or 
Open

  MPI, does not support the following: array
  subsections, direct passthru (where 
possible) to

  underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
   Java bindings: no
  Wrapper compiler rpath: runpath
  C compiler:

[OMPI users] segfault in libibverbs.so

2020-07-23 Thread Prentice Bisbal via users

I manage a cluster that is very heterogeneous. Some nodes have 
InfiniBand, while others have 10 Gb/s Ethernet. We recently upgraded to 
CentOS 7, and built a new software stack for CentOS 7. We are using 
OpenMPI 4.0.3, and we are using Slurm 19.05.5 as our job scheduler.


We just noticed that when jobs are sent to the nodes with IB, the 
segfault immediately, with the segfault appearing to come from 
libibverbs.so. This is what I see in the stderr output for one of these 
failed jobs:


srun: error: greene021: tasks 0-3: Segmentation fault

And here is what I see in the log messages of the compute node where 
that segfault happened:


Jul 23 15:19:41 greene021 kernel: mpihello[7911]: segfault at 
7f0635f38910 ip 7f0635f49405 sp 7ffe354485a0 error 4
Jul 23 15:19:41 greene021 kernel: mpihello[7912]: segfault at 
7f23d51ea910 ip 7f23d51fb405 sp 7ffef250a9a0 error 4
Jul 23 15:19:41 greene021 kernel: in 
libibverbs.so.1.5.22.4[7f23d51ec000+18000]

Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: mpihello[7909]: segfault at 
7ff504ba5910 ip 7ff504bb6405 sp 7917ccb0 error 4
Jul 23 15:19:41 greene021 kernel: in 
libibverbs.so.1.5.22.4[7ff504ba7000+18000]

Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: mpihello[7910]: segfault at 
7fa58abc5910 ip 7fa58abd6405 sp 7ffdde50c0d0 error 4
Jul 23 15:19:41 greene021 kernel: in 
libibverbs.so.1.5.22.4[7fa58abc7000+18000]

Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: in 
libibverbs.so.1.5.22.4[7f0635f3a000+18000]

Jul 23 15:19:41 greene021 kernel

Any idea what is going on here, or how to debug further? I've been using 
OpenMPI for years, and it usually just works.


I normally start my job with srun like this:

srun ./mpihello

But even if I try to take IB out of the equation by starting the job 
like this:


mpirun -mca btl ^openib ./mpihello

I still get a segfault issue, although the message to stderr is now a 
little different:


--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 1 with PID 8502 on node greene021 
exited on signal 11 (Segmentation fault).

--

The segfaults happens immediately. It seems to happen as soon as 
MPI_Init() is called. The program I'm running is very simple MPI "Hello 
world!" program.


The output of  ompi_info is below my signature, in case that helps.

Prentice

$ ompi_info
 Package: Open MPI u...@host.example.com Distribution
    Open MPI: 4.0.3
  Open MPI repo revision: v4.0.3
   Open MPI release date: Mar 03, 2020
    Open RTE: 4.0.3
  Open RTE repo revision: v4.0.3
   Open RTE release date: Mar 03, 2020
    OPAL: 4.0.3
  OPAL repo revision: v4.0.3
   OPAL release date: Mar 03, 2020
 MPI API: 3.1.0
    Ident string: 4.0.3
  Prefix: /usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3
 Configured architecture: x86_64-unknown-linux-gnu
  Configure host: dawson027.pppl.gov
   Configured by: lglant
   Configured on: Mon Jun  1 12:37:07 EDT 2020
  Configure host: dawson027.pppl.gov
  Configure command line: '--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3'
  '--with-ucx' '--with-verbs' '--with-libfabric'
  '--with-libevent=/usr'
'--with-libevent-libdir=/usr/lib64'
'--with-pmix=/usr/pppl/pmix/3.1.5' '--with-pmi'
    Built by: lglant
    Built on: Mon Jun  1 13:05:40 EDT 2020
  Built host: dawson027.pppl.gov
  C bindings: yes
    C++ bindings: no
 Fort mpif.h: yes (all)
    Fort use mpi: yes (full: ignore TKR)
   Fort use mpi size: deprecated-ompi-info-value
    Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
  limitations in the gfortran compiler and/or Open
  MPI, does not support the following: array
  subsections, direct passthru (where possible) to
  underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
   Java bindings: no
  Wrapper compiler rpath: runpath
  C compiler: gcc
 C compiler absolute: /usr/pppl/gcc/9.3.0/bin/gcc
  C compiler family name: GNU
  C compiler version: 9.3.0
    C++ compiler: g++
   C++ compiler absolute: /usr/pppl/gcc/9.3.0/bin/g++
   Fort compiler: gfortran
   Fort compiler abs: /usr/pppl/gcc/9.3.0/bin/gfortran
 Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)

Re: [OMPI users] [External] Re: choosing network: infiniband vs. ethernet

2020-07-20 Thread Prentice Bisbal via users


Jeff,

Then you'll be happy to know I've been building OpenMPI for years and I 
never had any complaints about your configure/build system. Of course, 
I'm a pro who gets paid to build open-source software all day long, but 
I have to say I've never had any issues with configure, make, or 'make 
check' with any version of OpenMPI.


Keep up the great work!

Prentice

On 7/18/20 9:36 AM, Jeff Squyres (jsquyres) via users wrote:
Woo hoo!  I love getting emails like this.  We actually spend quite a 
bit of time in the design and implementation of the configure/build 
system so that it will "just work" in a wide variety of situations.


Thanks!


On Jul 17, 2020, at 5:43 PM, John Duffy via users 
mailto:users@lists.open-mpi.org>> wrote:


Hi Lana

I’m a Open MPI newbie too, but I managed to build Open MPI 4.0.4 
quite easily on Ubuntu 20.04 just following the instructions in 
README/INSTALL in the top level source directory, namely:


mkdir build
cd build
../configure CFLAGS=“-O3”  # My CFLAGS
make all
sudo make all
sudo make install

It just worked. My small cluster happily runs Open MPI over TCP/1GB 
Ethernet.


The make install step installed everything into /usr/local. I did 
forget to ldconfig initially, which confused me. Other than that it 
just worked.


John




--
Jeff Squyres
jsquy...@cisco.com <mailto:jsquy...@cisco.com>


--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Re: [OMPI users] [External] Re: Signal code: Non-existant physical address (2)

2020-07-07 Thread Prentice Bisbal via users

Jeff,

Thanks for the detailed explanation. After some googling, it seemed like
this might be a bug in 1.10.3 that only reveals itself on certain
hardware. Since my user isn't interested in using a newer OpenMPI (but
he will be forced to soon enough when we upgrade our cluster!), he has
been using Slurms exclude feature to exclude those problem nodes.

The goods news is that in the fall we will have a new, homogeneous
cluster with all new hardware.

Prentice

On 7/6/20 7:47 AM, Jeff Squyres (jsquyres) wrote:

Greetings Prentice.

This is a very generic error, it's basically just indicating "somewhere in the
program, we got a bad pointer address."

It's very difficult to know if this issue is in Open MPI or in the application
itself (e.g., memory corruption by the application eventually lead to bad data
being used as a pointer, and then... kaboom).

You *may* be able to upgrade to at least the latest version of the 1.10 series:
1.10.7. It should be ABI compatible with 1.10.3; if the user's application is
dynamically linked against 1.10.3, you might just be able to change
LD_LIBRARY_PATH and point to a 1.10.7 installation. In this way, if the bus
error was caused by Open MPI itself, upgrading to v1.10.7 may fix it.

Other than that, based on the situation you're describing, if the problem only consistently happens
on nodes of a specific type in your cluster, it could also be that the application was compiled on
a machine that has a newer architecture than the "problem" nodes in your cluster. As
such, the compiler/assembler may have included instructions in the Open MPI library and/or
executable that simply do not exist on the "problem" nodes. When those instructions are
(attempted to be) executed on the older/problem nodes... kaboom.

This is admittedly unlikely; I would expect to see a different kind of error message in these kinds
of situations, but given the nature of your heterogeneous cluster, such things are definitely
possible (e.g., an invalid instruction causes a failure on the MPI processes on the
"problem" nodes, causing them to abort, but before Open MPI can kill all surviving
processes, other MPI processes end up in error states because of the unexpected failure from the
"problem" node processes, and at least one of them results in a bus error).

The rule of thumb for jobs that span heterogeneous nodes in a cluster is to
compile/link everything on the oldest node to make sure that the
compiler/linker don't put in instructions that won't work on old machines. You
can compile on newer nodes and use specific compiler/linker flags to restrict
generated instructions, too, but it can be difficult to track down the precise
flags that you need.

On Jul 2, 2020, at 10:22 AM, Prentice Bisbal via users
wrote:

I manage a very heterogeneous cluster. I have nodes of different ages with
different processors, different amounts of RAM, etc. One user is reporting that
on certain nodes, his jobs keep crashing with the errors below. His application
is using OpenMPI 1.10.3, which I know is an ancient version of OpenMPI, but
someone else in his research group built the code with that, so that's what
he's stuck with.

I did a Google search of "Signal code: Non-existant physical address", and it
appears that this may be a bug in 1.10.3 that happens on certain hardware. Can anyone
else confirm this? The full error message is below:

[dawson120:29064] *** Process received signal ***
[dawson120:29062] *** Process received signal ***
[dawson120:29062] Signal: Bus error (7)
[dawson120:29062] Signal code: Non-existant physical address (2)
[dawson120:29062] Failing at address: 0x7ff3f030f180
[dawson120:29067] *** Process received signal ***
[dawson120:29067] Signal: Bus error (7)
[dawson120:29067] Signal code: Non-existant physical address (2)
[dawson120:29067] Failing at address: 0x7fb2b8a61d18
[dawson120:29077] *** Process received signal ***
[dawson120:29078] *** Process received signal ***
[dawson120:29078] Signal: Bus error (7)
[dawson120:29078] Signal code: Non-existant physical address (2)
[dawson120:29078] Failing at address: 0x7f60a13d2c98
[dawson120:29078] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x7f60b7efd7e0]
[dawson120:29078] [ 1]
/usr/pppl/intel/2015-pkgs/openmpi-1.10.3/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_align+0x84)[0x7f60b20f6ea4]

I've asked the user to switch to a newer version of OpenMPI, but since his research group
is all using the same application and someone else built it, he's not in a position to do
that. For now, he's excluding the "bad" nodes with Slurm -x option.

I just want to know if this is in fact a bug in 1.10.3, or if there's something
we can do to fix this error.

Thanks,

--
Prentice

--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

[OMPI users] Signal code: Non-existant physical address (2)

2020-07-02 Thread Prentice Bisbal via users

I manage a very heterogeneous cluster. I have nodes of different ages 
with different processors, different amounts of RAM, etc. One user is 
reporting that on certain nodes, his jobs keep crashing with the errors 
below. His application is using OpenMPI 1.10.3, which I know is an 
ancient version of OpenMPI, but someone else in his research group built 
the code with that, so that's what he's stuck with.


I did a Google search of "Signal code: Non-existant physical address", 
and it appears that this may be a bug in 1.10.3 that happens on certain 
hardware. Can anyone else confirm this? The full error message is below:


[dawson120:29064] *** Process received signal ***
[dawson120:29062] *** Process received signal ***
[dawson120:29062] Signal: Bus error (7)
[dawson120:29062] Signal code: Non-existant physical address (2)
[dawson120:29062] Failing at address: 0x7ff3f030f180
[dawson120:29067] *** Process received signal ***
[dawson120:29067] Signal: Bus error (7)
[dawson120:29067] Signal code: Non-existant physical address (2)
[dawson120:29067] Failing at address: 0x7fb2b8a61d18
[dawson120:29077] *** Process received signal ***
[dawson120:29078] *** Process received signal ***
[dawson120:29078] Signal: Bus error (7)
[dawson120:29078] Signal code: Non-existant physical address (2)
[dawson120:29078] Failing at address: 0x7f60a13d2c98
[dawson120:29078] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x7f60b7efd7e0]
[dawson120:29078] [ 1] 
/usr/pppl/intel/2015-pkgs/openmpi-1.10.3/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_align+0x84)[0x7f60b20f6ea4]


I've asked the user to switch to a newer version of OpenMPI, but since 
his research group is all using the same application and someone else 
built it, he's not in a position to do that. For now, he's excluding the 
"bad" nodes with Slurm -x option.


I just want to know if this is in fact a bug in 1.10.3, or if there's 
something we can do to fix this error.


Thanks,

--
Prentice

Re: [OMPI users] [External] Re: can't open /dev/ipath, network down (err=26)

2020-05-11 Thread Prentice Bisbal via users


Thanks. I'm going to give this solution a try.

On 5/9/20 9:51 AM, Patrick Bégou via users wrote:

Le 08/05/2020 à 21:56, Prentice Bisbal via users a écrit :


We often get the following errors when more than one job runs on the 
same compute node. We are using Slurm with OpenMPI. The IB cards are 
QLogic using PSM:


10698ipath_userinit: assign_context command failed: Network is down
node01.10698can't open /dev/ipath, network down (err=26)
node01.10703ipath_userinit: assign_context command failed: Network is 
down

node01.10703can't open /dev/ipath, network down (err=26)
node01.10701ipath_userinit: assign_context command failed: Network is 
down

node01.10701can't open /dev/ipath, network down (err=26)
node01.10700ipath_userinit: assign_context command failed: Network is 
down

node01.10700can't open /dev/ipath, network down (err=26)
node01.10697ipath_userinit: assign_context command failed: Network is 
down

node01.10697can't open /dev/ipath, network down (err=26)
--
PSM was unable to open an endpoint. Please make sure that the network 
link is

active on the node and the hardware is functioning.

Error: Could not detect network connectivity
--

Any Ideas how to fix this?

--
Prentice



Hi Prentice,

This is not openMPI related but merely due to your hardware. I've not 
many details but I think this occurs when several jobs share the same 
node and you have a large number of cores on these nodes (> 14). If 
this is the case:


On Qlogic (I'm using such a hardware at this time) you have 16 channel 
for communication on each HBA and, if I remember what I had read many 
years ago, 2 are dedicated to the system. When launching MPI 
applications, each process of a job request for it's own dedicated 
channel if available, else they share ALL the available channels. So 
if a second job starts on the same node it do not remains any 
available channel.


To avoid this situation I force sharing the channels (my nodes have 20 
codes) by 2 MPI processes. You can set this with a simple environment 
variable. On all my cluster nodes I create the file:


*/etc/profile.d/ibsetcontext.sh*

And it contains:

# allow 4 processes to share an hardware MPI context
# in infiniband with PSM
*export PSM_RANKS_PER_CONTEXT=2*

Of course if some people manage to oversubscribe on the cores (more 
than one process by core) it could rise again the problem but we do 
not oversubscribe.


Hope this can help you.

Patrick


--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Re: [OMPI users] [External] can't open /dev/ipath, network down (err=26)

2020-05-11 Thread Prentice Bisbal via users


I believe they're DDR cards.

On 5/9/20 6:36 AM, Heinz, Michael William via users wrote:

Prentice,

Avoiding the obvious question of whether your FM is running and the fabric is 
in an active state, It sounds like your exhausting a resource on the cards. 
Ralph is correct about support for QLogic cards being long past but I’ll see 
what I can dig up in the archives on Monday to see if there’s a parameter you 
can adjust.

My vague recollection is that you shouldn’t try to have more compute processes 
than you have cores, that some resources are allocated on that basis. You might 
also look at the modinfo output for the device driver to see if there are any 
likely looking suspects.

Honestly, chances are better that you’ll get a hint from modinfo than that I’ll 
find a tuning guide laying around. Are these cards DDR or QDR?

Sent from my iPad


--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

[OMPI users] can't open /dev/ipath, network down (err=26)

2020-05-08 Thread Prentice Bisbal via users

We often get the following errors when more than one job runs on the 
same compute node. We are using Slurm with OpenMPI. The IB cards are 
QLogic using PSM:


10698ipath_userinit: assign_context command failed: Network is down
node01.10698can't open /dev/ipath, network down (err=26)
node01.10703ipath_userinit: assign_context command failed: Network is down
node01.10703can't open /dev/ipath, network down (err=26)
node01.10701ipath_userinit: assign_context command failed: Network is down
node01.10701can't open /dev/ipath, network down (err=26)
node01.10700ipath_userinit: assign_context command failed: Network is down
node01.10700can't open /dev/ipath, network down (err=26)
node01.10697ipath_userinit: assign_context command failed: Network is down
node01.10697can't open /dev/ipath, network down (err=26)
--
PSM was unable to open an endpoint. Please make sure that the network 
link is

active on the node and the hardware is functioning.

Error: Could not detect network connectivity
--

Any Ideas how to fix this?

--
Prentice

Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-05-06 Thread Prentice Bisbal via users

No, and I fear that may be the problem. When we built OpenMPI, we did 
--with-pmix=internal. Not sure how Slurm was built, since my coworker 
built it.



Prentice


On 4/28/20 2:07 AM, Daniel Letai via users wrote:


I know it's not supposed to matter, but have you tried building both 
ompi and slurm against same pmix? That is - first build pmix, than 
build slurm with-pmix, and than ompi with both slurm and pmix=external ?




On 23/04/2020 17:00, Prentice Bisbal via users wrote:


$ ompi_info | grep slurm
  Configure command line: 
'--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm'
 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component 
v4.0.3)
 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component 
v4.0.3)
 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component 
v4.0.3)
  MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component 
v4.0.3)


Any ideas what could be wrong? Do you need any additional information?

Prentice


--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Re: [OMPI users] [External] RE: Re: Can't start jobs with srun.

2020-05-06 Thread Prentice Bisbal via users

Thanks for the suggestion. We are using an NFSRoot OS image on all the 
nodes, so all the nodes have to be running the same version of OMPI.


On 4/27/20 10:58 AM, Riebs, Andy wrote:


Y’know, a quick check on versions and PATHs might be a good idea here. 
I suggest something like


$ srun  -N3  ompi_info  |&  grep  "MPI repo"

to confirm that all nodes are running the same version of OMPI.

*From:*users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of 
*Prentice Bisbal via users

*Sent:* Monday, April 27, 2020 10:25 AM
*To:* users@lists.open-mpi.org
*Cc:* Prentice Bisbal 
*Subject:* Re: [OMPI users] [External] Re: Can't start jobs with srun.

Ralph,

PMI2 support works just fine. It's just PMIx that seems to be the 
problem.


We rebuilt Slurm with PMIx 3.1.5, but the problem persists. I've 
opened a ticket with Slurm support to see if it's a problem on Slurm's 
end.


Prentice

On 4/26/20 2:12 PM, Ralph Castain via users wrote:

It is entirely possible that the PMI2 support in OMPI v4 is broken
- I doubt it is used or tested very much as pretty much everyone
has moved to PMIx. In fact, we completely dropped PMI-1 and PMI-2
from OMPI v5 for that reason.

I would suggest building Slurm with PMIx v3.1.5
(https://github.com/openpmix/openpmix/releases/tag/v3.1.5) as that
is what OMPI v4 is using, and launching with "srun --mpi=pmix_v3"



On Apr 26, 2020, at 10:07 AM, Patrick Bégou via users
mailto:users@lists.open-mpi.org>>
wrote:

I have also this problem on servers I'm benching at DELL's lab
with
OpenMPI-4.0.3. I've tried  a new build of OpenMPI with
"--with-pmi2". No
change.
Finally my work around in the slurm script was to launch my
code with
mpirun. As mpirun was only finding one slot per nodes I have used
"--oversubscribe --bind-to core" and checked that every
process was
binded on a separate core. It worked but do not ask me why :-)

Patrick

Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :

Prentice, have you tried something trivial, like "srun -N3
hostname", to rule out non-OMPI problems?

Andy

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On
Behalf Of Prentice Bisbal via users
Sent: Friday, April 24, 2020 2:19 PM
To: Ralph Castain mailto:r...@open-mpi.org>>; Open MPI Users
    mailto:users@lists.open-mpi.org>>
Cc: Prentice Bisbal mailto:pbis...@pppl.gov>>
Subject: Re: [OMPI users] [External] Re: Can't start jobs
with srun.

Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job
appears to be
running, but doesn't do anything - it just hangs in the
running state
but doesn't do anything. Any ideas what could be wrong, or
how to debug
this?

I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:

You can trust the --mpi=list. The problem is likely
that OMPI wasn't configured --with-pmi2



On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via
users mailto:users@lists.open-mpi.org>> wrote:

--mpi=list shows pmi2 and openmpi as valid values,
but if I set --mpi= to either of them, my job
still fails. Why is that? Can I not trust the
output of --mpi=list?

Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:

No, but you do have to explicitly build OMPI
with non-PMIx support if that is what you are
going to use. In this case, you need to
configure OMPI
--with-pmi2=

You can leave off the path if Slurm (i.e.,
just "--with-pmi2") was installed in a
standard location as we should find it there.



    On Apr 23, 2020, at 7:39 AM, Prentice
Bisbal via users mailto:users@lists.open-mpi.org>> wrote:

It looks like it was built with PMI2, but
not PMIx:

$ srun --mpi=list
srun: MPI types are...

Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-27 Thread Prentice Bisbal via users


Ralph,

PMI2 support works just fine. It's just PMIx that seems to be the problem.

We rebuilt Slurm with PMIx 3.1.5, but the problem persists. I've opened 
a ticket with Slurm support to see if it's a problem on Slurm's end.


Prentice

On 4/26/20 2:12 PM, Ralph Castain via users wrote:
It is entirely possible that the PMI2 support in OMPI v4 is broken - I 
doubt it is used or tested very much as pretty much everyone has moved 
to PMIx. In fact, we completely dropped PMI-1 and PMI-2 from OMPI v5 
for that reason.


I would suggest building Slurm with PMIx v3.1.5 
(https://github.com/openpmix/openpmix/releases/tag/v3.1.5) as that is 
what OMPI v4 is using, and launching with "srun --mpi=pmix_v3"



On Apr 26, 2020, at 10:07 AM, Patrick Bégou via users 
mailto:users@lists.open-mpi.org>> wrote:


I have also this problem on servers I'm benching at DELL's lab with
OpenMPI-4.0.3. I've tried  a new build of OpenMPI with "--with-pmi2". No
change.
Finally my work around in the slurm script was to launch my code with
mpirun. As mpirun was only finding one slot per nodes I have used
"--oversubscribe --bind-to core" and checked that every process was
binded on a separate core. It worked but do not ask me why :-)

Patrick

Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
Prentice, have you tried something trivial, like "srun -N3 
hostname", to rule out non-OMPI problems?


Andy

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
Prentice Bisbal via users

Sent: Friday, April 24, 2020 2:19 PM
To: Ralph Castain mailto:r...@open-mpi.org>>; Open 
MPI Users mailto:users@lists.open-mpi.org>>

Cc: Prentice Bisbal mailto:pbis...@pppl.gov>>
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.

Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job appears to be
running, but doesn't do anything - it just hangs in the running state
but doesn't do anything. Any ideas what could be wrong, or how to debug
this?

I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:
You can trust the --mpi=list. The problem is likely that OMPI 
wasn't configured --with-pmi2



On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:


--mpi=list shows pmi2 and openmpi as valid values, but if I set 
--mpi= to either of them, my job still fails. Why is that? Can I 
not trust the output of --mpi=list?


Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:
No, but you do have to explicitly build OMPI with non-PMIx 
support if that is what you are going to use. In this case, you 
need to configure OMPI --with-pmi2=


You can leave off the path if Slurm (i.e., just "--with-pmi2") 
was installed in a standard location as we should find it there.



On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:


It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:

Is Slurm built with PMIx support? Did you tell srun to use it?


On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> 
wrote:


I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the 
software with a very simple hello, world MPI program that I've 
used reliably for years. When I submit the job through slurm 
and use srun to launch the job, I get these errors:


*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,

***    and potentially your MPI job)
[dawson029.pppl.gov:26070 <http://dawson029.pppl.gov:26070>] 
Local abort before MPI_INIT completed completed successfully, 
but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,

***    and potentially your MPI job)
[dawson029.pppl.gov:26076 <http://dawson029.pppl.gov:26076>] 
Local abort before MPI_INIT completed completed successfully, 
but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!


If I run the same job, but use mpiexec or mpirun instead of 
srun, the jobs run just fine. I checked ompi_info to make sure 
OpenMPI was compiled with  Slurm support:


$ ompi_info | grep slurm
  Configure command line: 
'--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' 
'--with-pmix=i

Re: [OMPI users] [External] RE: Re: Can't start jobs with srun.

2020-04-27 Thread Prentice Bisbal via users

Yes. "srun -N3 hostname" works. The problem only seems to occur when I 
specify the --mpi option, so the problem seems related to PMI.


On 4/24/20 2:28 PM, Riebs, Andy wrote:

Prentice, have you tried something trivial, like "srun -N3 hostname", to rule 
out non-OMPI problems?

Andy

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
Bisbal via users
Sent: Friday, April 24, 2020 2:19 PM
To: Ralph Castain ; Open MPI Users 
Cc: Prentice Bisbal 
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.

Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job appears to be
running, but doesn't do anything - it just hangs in the running state
but doesn't do anything. Any ideas what could be wrong, or how to debug
this?

I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:

You can trust the --mpi=list. The problem is likely that OMPI wasn't configured 
--with-pmi2



On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
 wrote:

--mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
either of them, my job still fails. Why is that? Can I not trust the output of 
--mpi=list?

Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:

No, but you do have to explicitly build OMPI with non-PMIx support if that is what 
you are going to use. In this case, you need to configure OMPI 
--with-pmi2=

You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed in 
a standard location as we should find it there.



On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
 wrote:

It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:

Is Slurm built with PMIx support? Did you tell srun to use it?



On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
 wrote:

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
very simple hello, world MPI program that I've used reliably for years. When I 
submit the job through slurm and use srun to launch the job, I get these errors:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26070] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26076] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!

If I run the same job, but use mpiexec or mpirun instead of srun, the jobs run 
just fine. I checked ompi_info to make sure OpenMPI was compiled with  Slurm 
support:

$ ompi_info | grep slurm
Configure command line: '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm'
   MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
   MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
   MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)

Any ideas what could be wrong? Do you need any additional information?

Prentice

Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-24 Thread Prentice Bisbal via users


Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job appears to be 
running, but doesn't do anything - it just hangs in the running state 
but doesn't do anything. Any ideas what could be wrong, or how to debug 
this?


I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:

You can trust the --mpi=list. The problem is likely that OMPI wasn't configured 
--with-pmi2



On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
 wrote:

--mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
either of them, my job still fails. Why is that? Can I not trust the output of 
--mpi=list?

Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:

No, but you do have to explicitly build OMPI with non-PMIx support if that is what 
you are going to use. In this case, you need to configure OMPI 
--with-pmi2=

You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed in 
a standard location as we should find it there.



On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
 wrote:

It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:

Is Slurm built with PMIx support? Did you tell srun to use it?



On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
 wrote:

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
very simple hello, world MPI program that I've used reliably for years. When I 
submit the job through slurm and use srun to launch the job, I get these errors:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26070] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26076] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!

If I run the same job, but use mpiexec or mpirun instead of srun, the jobs run 
just fine. I checked ompi_info to make sure OpenMPI was compiled with  Slurm 
support:

$ ompi_info | grep slurm
   Configure command line: '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm'
  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)

Any ideas what could be wrong? Do you need any additional information?

Prentice

Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-23 Thread Prentice Bisbal via users

--mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= 
to either of them, my job still fails. Why is that? Can I not trust the 
output of --mpi=list?


Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:

No, but you do have to explicitly build OMPI with non-PMIx support if that is what 
you are going to use. In this case, you need to configure OMPI 
--with-pmi2=

You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed in 
a standard location as we should find it there.



On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
 wrote:

It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:

Is Slurm built with PMIx support? Did you tell srun to use it?



On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
 wrote:

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
very simple hello, world MPI program that I've used reliably for years. When I 
submit the job through slurm and use srun to launch the job, I get these errors:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26070] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26076] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!

If I run the same job, but use mpiexec or mpirun instead of srun, the jobs run 
just fine. I checked ompi_info to make sure OpenMPI was compiled with  Slurm 
support:

$ ompi_info | grep slurm
   Configure command line: '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm'
  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)

Any ideas what could be wrong? Do you need any additional information?

Prentice

Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-23 Thread Prentice Bisbal via users


It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:

Is Slurm built with PMIx support? Did you tell srun to use it?



On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
 wrote:

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
very simple hello, world MPI program that I've used reliably for years. When I 
submit the job through slurm and use srun to launch the job, I get these errors:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26070] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26076] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!

If I run the same job, but use mpiexec or mpirun instead of srun, the jobs run 
just fine. I checked ompi_info to make sure OpenMPI was compiled with  Slurm 
support:

$ ompi_info | grep slurm
   Configure command line: '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm'
  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)

Any ideas what could be wrong? Do you need any additional information?

Prentice

[OMPI users] Can't start jobs with srun.

2020-04-23 Thread Prentice Bisbal via users

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software 
with a very simple hello, world MPI program that I've used reliably for 
years. When I submit the job through slurm and use srun to launch the 
job, I get these errors:


*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[dawson029.pppl.gov:26070] Local abort before MPI_INIT completed 
completed successfully, but am not able to aggregate error messages, and 
not able to guarantee that all other processes were killed!

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[dawson029.pppl.gov:26076] Local abort before MPI_INIT completed 
completed successfully, but am not able to aggregate error messages, and 
not able to guarantee that all other processes were killed!


If I run the same job, but use mpiexec or mpirun instead of srun, the 
jobs run just fine. I checked ompi_info to make sure OpenMPI was 
compiled with  Slurm support:


$ ompi_info | grep slurm
  Configure command line: 
'--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm'

 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
  MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)

Any ideas what could be wrong? Do you need any additional information?

Prentice

Re: [OMPI users] [External] Re: AMD EPYC 7281: does NOT, support binding memory to the process location

2020-01-10 Thread Prentice Bisbal via users


Raymond,

Thanks for the info. Since we are still at CentOS 6, that is most likely 
the problem.


Prentice

On 1/8/20 8:52 PM, Raymond Muno via users wrote:
AMD, list the minimum supported kernel for EPYC/NAPLES as RHEL/Centos 
kernel 3.10-862, which is RHEL/CentOS 7.5 or later. Upgraded kernels 
can be used in 7.4.


http://developer.amd.com/wp-content/resources/56420.pdf

-Ray Muno

Re: [OMPI users] [External] Re: AMD EPYC 7281: does NOT, support binding memory to the process location

2020-01-08 Thread Prentice Bisbal via users




On 1/8/20 3:30 PM, Brice Goglin via users wrote:

Le 08/01/2020 à 21:20, Prentice Bisbal via users a écrit :

We just added about a dozen nodes to our cluster, which have AMD EPYC
7281 processors. When a particular users jobs fall on one of these
nodes, he gets these error messages:

--

WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.

   Node:  dawson205


I wonder if the CentOS 6 kernel properly supports these recent
processors. Does lstopo show NUMA nodes as expected?

Brice

lstopo shows different numa nodes, and it appears to be correct, but I 
don't use lstopo that much, so I'm not 100%  confident that what it's 
showing is correct. I'm at about 98%.


Prentice

[OMPI users] AMD EPYC 7281: does NOT, support binding memory to the process location

2020-01-08 Thread Prentice Bisbal via users

We just added about a dozen nodes to our cluster, which have AMD EPYC 
7281 processors. When a particular users jobs fall on one of these 
nodes, he gets these error messages:


--
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.

  Node:  dawson205

This usually is due to not having the required NUMA support installed
on the node. In some Linux distributions, the required support is
contained in the libnumactl and libnumactl-devel packages.
This is a warning only; your job will continue, though performance may 
be degraded.

--
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: NONE
   Node:    dawson205
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

The OS is CentOS 6, and numactl and numactl-devel are installed. Any 
idea what the issue is and how to fix it? Is SMT enabled when it 
shouldn't be, or something along those lines?


--
Prentice

[OMPI users] hwloc support for Power9/IBM AC922 servers

2019-04-16 Thread Prentice Bisbal via users


OpenMPI Users,

Are any of you using hwloc on Power9 hardware, specifically the IBM 
AC922 servers? If so, have you encountered any issues? I checked the 
documentation for the latest version (2.03), and found this:


Since it uses standard Operating System information, hwloc's support 
is mostly independant from the processor type

(x86, powerpc, ...) and just relies on the Operating System support.


and this:

To check whether hwloc works on a particular machine, just try to 
build it and run lstopo or lstopo-no-graphics.

If some things do not look right (e.g. bogus or missing cache information


We haven't bought any AC922 nodes yet, so i can't try that just yet  We 
are looking to purchase a small cluster, and want to make sure there are 
no known issues between the hardware and software before we make a 
purchase.


Any feedback will be greatly appreciated.

Thanks,

Prentice

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 2.1.0 + PGI 17.3 = asm test failures

2018-11-28 Thread Prentice Bisbal via users


Sylvain,

I just ran into the same exact errors when compiling OpenMPI 3.0.0 with 
PGI 18.3


Prentice

On 5/1/17 2:58 PM, Sylvain Jeaugey wrote:

I also saw IBM and ignored the email :-)

Thanks for reporting the issue, I passed it to the PGI team.

On 05/01/2017 11:49 AM, Prentice Bisbal wrote:

Jeff,

You probably were thrown off when I said I've only really seen this 
problem when people didn't cross-compile correctly on the Blue Gene/P 
I used to support. Also, PGI, and IBM both have 3 letters...;)


Prentice

On 05/01/2017 02:20 PM, Jeff Squyres (jsquyres) wrote:

Er... right.  Duh.



On May 1, 2017, at 11:21 AM, Prentice Bisbal  wrote:

Jeff,

Why IBM? This problem is caused by the PGI compilers, so shouldn't 
this be directed towards NVidia, which now owns PGI?


Prentice

On 04/29/2017 07:37 AM, Jeff Squyres (jsquyres) wrote:

IBM: can someone check to see if this is a compiler error?


On Apr 28, 2017, at 5:09 PM, Prentice Bisbal  
wrote:


Update: removing the -fast switch caused this error to go away.

Prentice

On 04/27/2017 06:00 PM, Prentice Bisbal wrote:
I'm building Open MPI 2.1.0 with PGI 17.3, and now I'm getting 
'illegal instruction' errors during 'make check':


../../config/test-driver: line 107: 65169 Illegal instruction 
"$@" > $log_file 2>&1

FAIL: atomic_math
    - 1 threads: Passed

That's just one example of the error output. See all relevant 
error output below.


Usually, I see these errors when trying to run an executable on 
a processor that doesn't support the instruction set of the 
executable. I used to see this all the time when I supported an 
IBM Blue Gene/P system. I don't think I've ever seen it on an 
x86 system.


I'm passing the argument '-tp=x64' to pgcc to build a unified 
binary, so that might be part of the problem, but I've used this 
exact same process to build 2.1.0 with PGI 16.5 just a couple 
hours ago. I also built 1.10.3 with the same compiler flags with 
PGI 16.5 and 17.3 without this error.


Any ideas?

The relevant output from 'make check':

make  check-TESTS
make[3]: Entering directory `/local/pbisbal/openmpi-2.1.0/test/asm'
make[4]: Entering directory `/local/pbisbal/openmpi-2.1.0/test/asm'
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_barrier
    - 1 threads: Passed
PASS: atomic_barrier
    - 2 threads: Passed
PASS: atomic_barrier
    - 4 threads: Passed
PASS: atomic_barrier
    - 5 threads: Passed
PASS: atomic_barrier
    - 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_barrier_noinline
    - 1 threads: Passed
PASS: atomic_barrier_noinline
    - 2 threads: Passed
PASS: atomic_barrier_noinline
    - 4 threads: Passed
PASS: atomic_barrier_noinline
    - 5 threads: Passed
PASS: atomic_barrier_noinline
    - 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_spinlock
    - 1 threads: Passed
PASS: atomic_spinlock
    - 2 threads: Passed
PASS: atomic_spinlock
    - 4 threads: Passed
PASS: atomic_spinlock
    - 5 threads: Passed
PASS: atomic_spinlock
    - 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_spinlock_noinline
    - 1 threads: Passed
PASS: atomic_spinlock_noinline
    - 2 threads: Passed
PASS: atomic_spinlock_noinline
    - 4 threads: Passed
PASS: atomic_spinlock_noinline
    - 5 threads: Passed
PASS: atomic_spinlock_noinline
    - 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
../../config/test-driver: line 107: 65169 Illegal instruction 
"$@" > $log_file 2>&1

FAIL: atomic_math
    - 1 threads: Passed
../../config/test-driver: line 107: 65172 Illegal instruction 
"$@" > $log_file 2>&1

FAIL: atomic_math
    - 2 threads: Passed
../../config/test-driver: line 107: 65176 Illegal instruction 
"$@" > $log_file 2>&1

FAIL: atomic_math
    - 4 threads: Passed
../../config/test-driver: line 107: 65180 Illegal instruction 
"$@" > $log_file 2>&1

FAIL: atomic_math
    - 5 threads: Passed
../../config/test-driver: line 107: 65185 Illegal instruction 
"$@" > $log_file 2>&1

FAIL: atomic_math
    - 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
../../config/test-driver: line 107: 65195 Illegal instruction 
"$@" > $log_file 2>&1

FAIL: atomic_math_noinline
    - 1 threads: Passed
../../config/test-driver: line 107: 65198 Illegal instruction 
"$@" > $log_file 2>&1

FAIL: atomic_math_noinline
    - 2 threads: Passed
../../config/test-driver: line 107: 65202 Illegal instruction 
"$@" > $log_file 2>&1

FAIL: atomic_math_noinline
    - 4 thre

Re: [OMPI users] OpenMPI & Slurm: mpiexec/mpirun vs. srun

2017-12-19 Thread Prentice Bisbal


Ralph,

Thank your very much for your response. I'll pass this along to my 
users. Sounds lie we might need to do some testing of our own. We're 
still using Slurm 15.08, but planning to upgrade to 17.11 soon, so it 
sounds like we'll get some performance benefits from doing so.


Prentice

On 12/18/2017 08:12 PM, r...@open-mpi.org wrote:

We have had reports of applications running faster when executing under OMPI’s 
mpiexec versus when started by srun. Reasons aren’t entirely clear, but are 
likely related to differences in mapping/binding options (OMPI provides a very 
large range compared to srun) and optimization flags provided by mpiexec that 
are specific to OMPI.

OMPI uses PMIx for wireup support (starting with the v2.x series), which 
provides a faster startup than other PMI implementations. However, that is also 
available with Slurm starting with the 16.05 release, and some further 
PMIx-based launch optimizations were recently added to the Slurm 17.11 release. 
So I would expect that launch via srun with the latest Slurm release and PMIx 
would be faster than mpiexec - though that still leaves the faster execution 
reports to consider.

HTH
Ralph



On Dec 18, 2017, at 2:18 PM, Prentice Bisbal <pbis...@pppl.gov> wrote:

Greeting OpenMPI users and devs!

We use OpenMPI with Slurm as our scheduler, and a user has asked me this: 
should they use mpiexec/mpirun or srun to start their MPI jobs through Slurm?

My inclination is to use mpiexec, since that is the only method that's 
(somewhat) defined in the MPI standard and therefore the most portable, and the 
examples in the OpenMPI FAQ use mpirun. However, the Slurm documentation on the 
schedmd website say to use srun with the --mpi=pmi option. (See links below)

What are the pros/cons of using these two methods, other than the portability 
issue I already mentioned? Does srun+pmi use a different method to wire up the 
connections? Some things I read online seem to indicate that. If slurm was 
built with PMI support, and OpenMPI was built with Slurm support, does it 
really make any difference?

https://www.open-mpi.org/faq/?category=slurm
https://slurm.schedmd.com/mpi_guide.html#open_mpi


--
Prentice

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] OpenMPI & Slurm: mpiexec/mpirun vs. srun

2017-12-18 Thread Prentice Bisbal


Greeting OpenMPI users and devs!

We use OpenMPI with Slurm as our scheduler, and a user has asked me 
this: should they use mpiexec/mpirun or srun to start their MPI jobs 
through Slurm?


My inclination is to use mpiexec, since that is the only method that's 
(somewhat) defined in the MPI standard and therefore the most portable, 
and the examples in the OpenMPI FAQ use mpirun. However, the Slurm 
documentation on the schedmd website say to use srun with the --mpi=pmi 
option. (See links below)


What are the pros/cons of using these two methods, other than the 
portability issue I already mentioned? Does srun+pmi use a different 
method to wire up the connections? Some things I read online seem to 
indicate that. If slurm was built with PMI support, and OpenMPI was 
built with Slurm support, does it really make any difference?


https://www.open-mpi.org/faq/?category=slurm
https://slurm.schedmd.com/mpi_guide.html#open_mpi


--
Prentice

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] bind-to-core with AMD CMT?

2017-08-29 Thread Prentice Bisbal


I'd like to follow up to my own e-mail...

After playing around with the --bind-to options, it seems there is no 
way to do this with AMD CMT processors, since they are actual physical 
cores, and not hardware threads that appear as "logical cores" as with 
Intel processors with hyperthreading. Which, in hindsight, makes perfect 
sense.


In the BIOS, you can turn reduce the number of cores to match the number 
of FPUs. On the SuperMicro systems I was testing on, the options is 
called "Downcore" (or somethiing like that) and I set to a value of 
"compute unit"


Prentice

On 08/24/2017 03:11 PM, Prentice Bisbal wrote:

OpenMPI Users,

I am using AMD processocers with CMT, where two cores constitute a 
module, and there is only one FPU per module, so each pair of cores 
has to share a single FPU.  I want to use only one core per module so 
there is no contention between cores in the same module for the single 
FPU. Is this possible from the command-line using mpirun with the 
correct binding specifications? If so, how would I do this?


I am using OpenMPI 1.10.3. I read the man page regarding the 
bind-to-core options, and I'm not sure that will do exactly what I 
want, so I figured I'd ask the experts here.




___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] bind-to-core with AMD CMT?

2017-08-24 Thread Prentice Bisbal


OpenMPI Users,

I am using AMD processocers with CMT, where two cores constitute a 
module, and there is only one FPU per module, so each pair of cores has 
to share a single FPU.  I want to use only one core per module so there 
is no contention between cores in the same module for the single FPU. Is 
this possible from the command-line using mpirun with the correct 
binding specifications? If so, how would I do this?


I am using OpenMPI 1.10.3. I read the man page regarding the 
bind-to-core options, and I'm not sure that will do exactly what I want, 
so I figured I'd ask the experts here.


--
Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Strange OpenMPI errors showing up in Caffe rc5 build

2017-05-08 Thread Prentice Bisbal



On 05/06/2017 03:28 AM, Lane, William wrote:


The strange thing is OpenMPI isn't mentioned anywhere as being a 
dependency for Caffe! I haven't read anything that suggests OpenMPI is 
supported  in Caffe either. This is why I figure it must be a 
dependency of Caffe (of which there are 15) that relies on OpenMPI.




Are you sure you didn't donwload Caffe MPI by acident? Both versions are 
open-source and available from Github:


http://www.inspursystems.com/dl/open-source-caffe-mpi-download/



I tried setting the compiler to mpic++ in the Makefile.config file and 
the result was:


Makefile:314: *** Cannot static link with the mpic++ compiler.  Stop.


I'm going to try explicitly enumerating all OpenMPI libraries in 
Makefile.config and see if that makes a difference.




I would not recommend that. It's always better to use the wrapper 
scripts (mpicc, mpic++, mpif90, etc.). If that's not working, it would 
be better for you to find out why and fix that problem I would start 
with a simple MPI-enabled "Hello, world!" C++ program. See if you can 
compile that, and go from there. Start simple and work your way up from 
there.


I've seen errors similar to yours in the past that were caused by the 
wrong switches being passed to the compiler. I've also seen similar 
errors when the compiler command was screwed up, too  (typo, etc.) Also, 
check to make sure that the static OpenMPI libraries exist. I think 
they're built by default, but I could be wrong. I always explicitly 
specify building both static and dynamic libraries for all my software 
at configure time.


Also, you when you post error  messages, ALWAYS include the command that 
caused the error, too. An error message by itself, like the one above, 
doesn't give us much information to help you diagnose the problem. If 
you provided the command, it's possible that someone on the list could 
immediately see  a problem with the command and quickly pinpoint the 
problem.


Prentice



Thanks for your help, the Caffe listserve group doesn't have any 
answers for this issue (except use the Docker image).



-William L.


*From:* users <users-boun...@lists.open-mpi.org> on behalf of Prentice 
Bisbal <pbis...@pppl.gov>

*Sent:* Friday, May 5, 2017 7:47:39 AM
*To:* users@lists.open-mpi.org
*Subject:* Re: [OMPI users] Strange OpenMPI errors showing up in Caffe 
rc5 build


On 05/04/2017 09:08 PM, gil...@rist.or.jp wrote:


William,

the link error clearly shows libcaffe.so does require C++ bindings.

did you build caffe from a fresh tree ?

what if you

ldd libcaffe.so

nm libcaffe.so | grep -i ompi

if libcaffe.so does require mpi c++ bindings, it should depend on it

(otherwise the way it was built is questionnable)

you might want to link with mpic++ instead of g++



This is a great point I missed in my previous e-mail on this topic. 
When compiling a program that uses MPI, you want to specify the MPI 
compiler wrappers for your C, C++ and Fortran Compilers, and not your 
chosen compiler directly, For example:


./configure --prefix=/usr/local/foo-1.2.3 CC=mpicc CXX=mpicxx FC=mpif90

 or something similar. This guarantees that the actual compiler is 
called with all the write flags for the C preprocessor, linker, etc. 
This will almost always prevent those linking errors.


note mpi C++ bindings are no more built by default since v2.0, so you 
likely have to


configure --enable-mpi-cxx

last but not least, make sure caffe and openmpi were built with the 
same c++ compiler


Cheers,

Gilles

- Original Message -

I know this could possibly be off-topic, but the errors are
OpenMPI errors and if anyone could shed light on the nature of
these errors I figure it would be this group:

CXX/LD -o .build_release/tools/upgrade_solver_proto_text.bin
g++ .build_release/tools/upgrade_solver_proto_text.o -o
.build_release/tools/upgrade_solver_proto_text.bin -pthread
-fPIC -DCAFFE_VERSION=1.0.0-rc5 -DNDEBUG -O2 -DUSE_OPENCV
-DUSE_LEVELDB -DUSE_LMDB -DCPU_ONLY -DWITH_PYTHON_LAYER
-I/hpc/apps/python27/include/python2.7

-I/hpc/apps/python27/externals/numpy/1.9.2/lib/python2.7/site-packages/numpy/core/include
-I/usr/local/include -I/hpc/apps/hdf5/1.8.17/include
-I.build_release/src -I./src -I./include
-I/hpc/apps/atlas/3.10.2/include -Wall -Wno-sign-compare
-lcaffe -L/hpc/apps/gflags/lib -L/hpc/apps/python27/lib
-L/hpc/apps/python27/lib/python2.7
-L/hpc/apps/atlas/3.10.2/lib -L.build_release/lib-lglog
-lgflags -lprotobuf -lboost_system -lboost_filesystem -lm
-lhdf5_hl -lhdf5 -lleveldb -lsnappy -llmdb -lopencv_core
-lopencv_highgui -lopencv_imgproc -lboost_thread -lstdc++
-lboost_python -lpython2.7 -lcblas -latlas \
-Wl,-rpath,\$ORIGIN/../lib
.build_release/lib/libcaffe.s

Re: [OMPI users] Strange OpenMPI errors showing up in Caffe rc5 build

2017-05-05 Thread Prentice Bisbal



On 05/04/2017 09:08 PM, gil...@rist.or.jp wrote:


William,

the link error clearly shows libcaffe.so does require C++ bindings.

did you build caffe from a fresh tree ?

what if you

ldd libcaffe.so

nm libcaffe.so | grep -i ompi

if libcaffe.so does require mpi c++ bindings, it should depend on it

(otherwise the way it was built is questionnable)

you might want to link with mpic++ instead of g++



This is a great point I missed in my previous e-mail on this topic. When 
compiling a program that uses MPI, you want to specify the MPI compiler 
wrappers for your C, C++ and Fortran Compilers, and not your chosen 
compiler directly, For example:


./configure --prefix=/usr/local/foo-1.2.3 CC=mpicc CXX=mpicxx FC=mpif90

 or something similar. This guarantees that the actual compiler is 
called with all the write flags for the C preprocessor, linker, etc. 
This will almost always prevent those linking errors.


note mpi C++ bindings are no more built by default since v2.0, so you 
likely have to


configure --enable-mpi-cxx

last but not least, make sure caffe and openmpi were built with the 
same c++ compiler


Cheers,

Gilles

- Original Message -

I know this could possibly be off-topic, but the errors are
OpenMPI errors and if anyone could shed light on the nature of
these errors I figure it would be this group:

CXX/LD -o .build_release/tools/upgrade_solver_proto_text.bin
g++ .build_release/tools/upgrade_solver_proto_text.o -o
.build_release/tools/upgrade_solver_proto_text.bin -pthread
-fPIC -DCAFFE_VERSION=1.0.0-rc5 -DNDEBUG -O2 -DUSE_OPENCV
-DUSE_LEVELDB -DUSE_LMDB -DCPU_ONLY -DWITH_PYTHON_LAYER
-I/hpc/apps/python27/include/python2.7

-I/hpc/apps/python27/externals/numpy/1.9.2/lib/python2.7/site-packages/numpy/core/include
-I/usr/local/include -I/hpc/apps/hdf5/1.8.17/include
-I.build_release/src -I./src -I./include
-I/hpc/apps/atlas/3.10.2/include -Wall -Wno-sign-compare
-lcaffe -L/hpc/apps/gflags/lib -L/hpc/apps/python27/lib
-L/hpc/apps/python27/lib/python2.7
-L/hpc/apps/atlas/3.10.2/lib -L.build_release/lib-lglog
-lgflags -lprotobuf -lboost_system -lboost_filesystem -lm
-lhdf5_hl -lhdf5 -lleveldb -lsnappy -llmdb -lopencv_core
-lopencv_highgui -lopencv_imgproc -lboost_thread -lstdc++
-lboost_python -lpython2.7 -lcblas -latlas \
-Wl,-rpath,\$ORIGIN/../lib
.build_release/lib/libcaffe.so: undefined reference to
`ompi_mpi_cxx_op_intercept'
.build_release/lib/libcaffe.so: undefined reference to
`MPI::Datatype::Free()'
.build_release/lib/libcaffe.so: undefined reference to
`MPI::Comm::Comm()'
.build_release/lib/libcaffe.so: undefined reference to
`MPI::Win::Free()'
collect2: error: ld returned 1 exit status

I've read this may be due to a dependency of Caffe that uses
OpenMPI (since I've been told Caffe itself doesn't use OpenMPI).

Would adding -l directives to LIBRARIES line in the Makefile for
Caffe that reference all OpenMPI libraries fix this problem?

For example, -l mpi.

Thank you in advance. Hopefully this isn't entirely OT.

William L.

IMPORTANT WARNING: This message is intended for the use of the
person or entity to which it is addressed and may contain
information that is privileged and confidential, the disclosure of
which is governed by applicable law. If the reader of this message
is not the intended recipient, or the employee or agent
responsible for delivering it to the intended recipient, you are
hereby notified that any dissemination, distribution or copying of
this information is strictly prohibited. Thank you for your
cooperation. IMPORTANT WARNING: This message is intended for the
use of the person or entity to which it is addressed and may
contain information that is privileged and confidential, the
disclosure of which is governed by applicable law. If the reader
of this message is not the intended recipient, or the employee or
agent responsible for delivering it to the intended recipient, you
are hereby notified that any dissemination, distribution or
copying of this information is strictly prohibited. Thank you for
your cooperation. 




___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Strange OpenMPI errors on building Caffe 1.0

2017-05-05 Thread Prentice Bisbal

This error should really be posted to the caffe mailing list. This is an 
error with caffe. Most likely, you are not specifying the location to 
your Open MPI installation properly. And Caffe definitely depends on 
OpenMPI you errors:



.build_release/lib/libcaffe.so: undefined reference to 
`ompi_mpi_cxx_op_intercept'
.build_release/lib/libcaffe.so: undefined reference to 
`MPI::Datatype::Free()'

.build_release/lib/libcaffe.so: undefined reference to `MPI::Comm::Comm()'
.build_release/lib/libcaffe.so: undefined reference to `MPI::Win::Free()'


Are basically saying that the libcaffe shared library (libcaffe.so) was 
compiled making references to MPI functions, but now can't find the 
libraries that actually provide those functions. This means that when 
libcaffe.so was compiled, it could find the OpenMPI headers containing 
the function prototypes, but can't the actual libraries.



To fix, your command needs a -L argument specifying the path to where 
the OpenMPI libaries are located, followed by -l (lower case L) 
arguments for each of the MPI libraries you need. -lmpi is probably one 
of them, but most MPI implementations require additional libraries, such 
as -lopen-rte, -lopen-pal, etc., for Open MPI



While I think I've answered your questions, it's best you ask this on a 
Caffe mailing list, because if the build process could find your MPI 
headers but not your MPI libraries, you either configured your build 
incorrectly, or something about the Caffe configure/build process is 
broken, so you either need to find out how to configure your build 
correctly, or report a bug in the Caffe build process.



Prentice

On 05/04/2017 05:35 PM, Lane, William wrote:


I know this could possibly be off-topic, but the errors are OpenMPI 
errors and if anyone could shed light on the nature of these errors I 
figure it would be this group:

CXX/LD -o .build_release/tools/upgrade_solver_proto_text.bin
g++ .build_release/tools/upgrade_solver_proto_text.o -o 
.build_release/tools/upgrade_solver_proto_text.bin -pthread -fPIC 
-DCAFFE_VERSION=1.0.0-rc5 -DNDEBUG -O2 -DUSE_OPENCV -DUSE_LEVELDB 
-DUSE_LMDB -DCPU_ONLY -DWITH_PYTHON_LAYER 
-I/hpc/apps/python27/include/python2.7 
-I/hpc/apps/python27/externals/numpy/1.9.2/lib/python2.7/site-packages/numpy/core/include 
-I/usr/local/include -I/hpc/apps/hdf5/1.8.17/include 
-I.build_release/src -I./src -I./include 
-I/hpc/apps/atlas/3.10.2/include -Wall -Wno-sign-compare -lcaffe 
-L/hpc/apps/gflags/lib -L/hpc/apps/python27/lib 
-L/hpc/apps/python27/lib/python2.7 -L/hpc/apps/atlas/3.10.2/lib 
-L.build_release/lib  -lglog -lgflags -lprotobuf -lboost_system 
-lboost_filesystem -lm -lhdf5_hl -lhdf5 -lleveldb -lsnappy -llmdb 
-lopencv_core -lopencv_highgui -lopencv_imgproc -lboost_thread 
-lstdc++ -lboost_python -lpython2.7 -lcblas -latlas \

-Wl,-rpath,\$ORIGIN/../lib
.build_release/lib/libcaffe.so: undefined reference to 
`ompi_mpi_cxx_op_intercept'
.build_release/lib/libcaffe.so: undefined reference to 
`MPI::Datatype::Free()'

.build_release/lib/libcaffe.so: undefined reference to `MPI::Comm::Comm()'
.build_release/lib/libcaffe.so: undefined reference to `MPI::Win::Free()'
collect2: error: ld returned 1 exit status
I've read this may be due to a dependency of Caffe that uses OpenMPI 
(since I've been told Caffe itself doesn't use OpenMPI).


Would adding -l directives to LIBRARIES line in the Makefile for Caffe 
that reference all OpenMPI libraries fix this problem?

For example, -l mpi.

Thank you in advance. Hopefully this isn't entirely OT.

William L.

IMPORTANT WARNING: This message is intended for the use of the person 
or entity to which it is addressed and may contain information that is 
privileged and confidential, the disclosure of which is governed by 
applicable law. If the reader of this message is not the intended 
recipient, or the employee or agent responsible for delivering it to 
the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this information is strictly 
prohibited. Thank you for your cooperation.



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 2.1.0 + PGI 17.3 = asm test failures

2017-05-01 Thread Prentice Bisbal


Jeff,

You probably were thrown off when I said I've only really seen this 
problem when people didn't cross-compile correctly on the Blue Gene/P I 
used to support. Also, PGI, and IBM both have 3 letters...;)


Prentice

On 05/01/2017 02:20 PM, Jeff Squyres (jsquyres) wrote:

Er... right.  Duh.



On May 1, 2017, at 11:21 AM, Prentice Bisbal <pbis...@pppl.gov> wrote:

Jeff,

Why IBM? This problem is caused by the PGI compilers, so shouldn't this be 
directed towards NVidia, which now owns PGI?

Prentice

On 04/29/2017 07:37 AM, Jeff Squyres (jsquyres) wrote:

IBM: can someone check to see if this is a compiler error?



On Apr 28, 2017, at 5:09 PM, Prentice Bisbal <pbis...@pppl.gov> wrote:

Update: removing the -fast switch caused this error to go away.

Prentice

On 04/27/2017 06:00 PM, Prentice Bisbal wrote:

I'm building Open MPI 2.1.0 with PGI 17.3, and now I'm getting 'illegal 
instruction' errors during 'make check':

../../config/test-driver: line 107: 65169 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math
- 1 threads: Passed

That's just one example of the error output. See all relevant error output 
below.

Usually, I see these errors when trying to run an executable on a processor 
that doesn't support the instruction set of the executable. I used to see this 
all the time when I supported an IBM Blue Gene/P system. I don't think I've 
ever seen it on an x86 system.

I'm passing the argument '-tp=x64' to pgcc to build a unified binary, so that 
might be part of the problem, but I've used this exact same process to build 
2.1.0 with PGI 16.5 just a couple hours ago. I also built 1.10.3 with the same 
compiler flags with PGI 16.5 and 17.3 without this error.

Any ideas?

The relevant output from 'make check':

make  check-TESTS
make[3]: Entering directory `/local/pbisbal/openmpi-2.1.0/test/asm'
make[4]: Entering directory `/local/pbisbal/openmpi-2.1.0/test/asm'
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_barrier
- 1 threads: Passed
PASS: atomic_barrier
- 2 threads: Passed
PASS: atomic_barrier
- 4 threads: Passed
PASS: atomic_barrier
- 5 threads: Passed
PASS: atomic_barrier
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_barrier_noinline
- 1 threads: Passed
PASS: atomic_barrier_noinline
- 2 threads: Passed
PASS: atomic_barrier_noinline
- 4 threads: Passed
PASS: atomic_barrier_noinline
- 5 threads: Passed
PASS: atomic_barrier_noinline
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_spinlock
- 1 threads: Passed
PASS: atomic_spinlock
- 2 threads: Passed
PASS: atomic_spinlock
- 4 threads: Passed
PASS: atomic_spinlock
- 5 threads: Passed
PASS: atomic_spinlock
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_spinlock_noinline
- 1 threads: Passed
PASS: atomic_spinlock_noinline
- 2 threads: Passed
PASS: atomic_spinlock_noinline
- 4 threads: Passed
PASS: atomic_spinlock_noinline
- 5 threads: Passed
PASS: atomic_spinlock_noinline
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
../../config/test-driver: line 107: 65169 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math
- 1 threads: Passed
../../config/test-driver: line 107: 65172 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math
- 2 threads: Passed
../../config/test-driver: line 107: 65176 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math
- 4 threads: Passed
../../config/test-driver: line 107: 65180 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math
- 5 threads: Passed
../../config/test-driver: line 107: 65185 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
../../config/test-driver: line 107: 65195 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math_noinline
- 1 threads: Passed
../../config/test-driver: line 107: 65198 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math_noinline
- 2 threads: Passed
../../config/test-driver: line 107: 65202 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math_noinline
- 4 threads: Passed
../../config/test-driver: line 107: 65206 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math_noinline
- 5 threads: Passed
../../config/test-driver: line 107: 65210 Illegal instruction "$@" > $log_file 
2>&1

Re: [OMPI users] OpenMPI 2.1.0 + PGI 17.3 = asm test failures

2017-05-01 Thread Prentice Bisbal


Jeff,

Why IBM? This problem is caused by the PGI compilers, so shouldn't this 
be directed towards NVidia, which now owns PGI?


Prentice

On 04/29/2017 07:37 AM, Jeff Squyres (jsquyres) wrote:

IBM: can someone check to see if this is a compiler error?



On Apr 28, 2017, at 5:09 PM, Prentice Bisbal <pbis...@pppl.gov> wrote:

Update: removing the -fast switch caused this error to go away.

Prentice

On 04/27/2017 06:00 PM, Prentice Bisbal wrote:

I'm building Open MPI 2.1.0 with PGI 17.3, and now I'm getting 'illegal 
instruction' errors during 'make check':

../../config/test-driver: line 107: 65169 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math
- 1 threads: Passed

That's just one example of the error output. See all relevant error output 
below.

Usually, I see these errors when trying to run an executable on a processor 
that doesn't support the instruction set of the executable. I used to see this 
all the time when I supported an IBM Blue Gene/P system. I don't think I've 
ever seen it on an x86 system.

I'm passing the argument '-tp=x64' to pgcc to build a unified binary, so that 
might be part of the problem, but I've used this exact same process to build 
2.1.0 with PGI 16.5 just a couple hours ago. I also built 1.10.3 with the same 
compiler flags with PGI 16.5 and 17.3 without this error.

Any ideas?

The relevant output from 'make check':

make  check-TESTS
make[3]: Entering directory `/local/pbisbal/openmpi-2.1.0/test/asm'
make[4]: Entering directory `/local/pbisbal/openmpi-2.1.0/test/asm'
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_barrier
- 1 threads: Passed
PASS: atomic_barrier
- 2 threads: Passed
PASS: atomic_barrier
- 4 threads: Passed
PASS: atomic_barrier
- 5 threads: Passed
PASS: atomic_barrier
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_barrier_noinline
- 1 threads: Passed
PASS: atomic_barrier_noinline
- 2 threads: Passed
PASS: atomic_barrier_noinline
- 4 threads: Passed
PASS: atomic_barrier_noinline
- 5 threads: Passed
PASS: atomic_barrier_noinline
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_spinlock
- 1 threads: Passed
PASS: atomic_spinlock
- 2 threads: Passed
PASS: atomic_spinlock
- 4 threads: Passed
PASS: atomic_spinlock
- 5 threads: Passed
PASS: atomic_spinlock
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_spinlock_noinline
- 1 threads: Passed
PASS: atomic_spinlock_noinline
- 2 threads: Passed
PASS: atomic_spinlock_noinline
- 4 threads: Passed
PASS: atomic_spinlock_noinline
- 5 threads: Passed
PASS: atomic_spinlock_noinline
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
../../config/test-driver: line 107: 65169 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math
- 1 threads: Passed
../../config/test-driver: line 107: 65172 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math
- 2 threads: Passed
../../config/test-driver: line 107: 65176 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math
- 4 threads: Passed
../../config/test-driver: line 107: 65180 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math
- 5 threads: Passed
../../config/test-driver: line 107: 65185 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
../../config/test-driver: line 107: 65195 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math_noinline
- 1 threads: Passed
../../config/test-driver: line 107: 65198 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math_noinline
- 2 threads: Passed
../../config/test-driver: line 107: 65202 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math_noinline
- 4 threads: Passed
../../config/test-driver: line 107: 65206 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math_noinline
- 5 threads: Passed
../../config/test-driver: line 107: 65210 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_math_noinline
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
../../config/test-driver: line 107: 65220 Illegal instruction "$@" > $log_file 
2>&1
FAIL: atomic_cmpset
- 1 threads: Passed
../../config/test-driver: line 107: 65223 Illegal instruction "$@" >

Re: [OMPI users] OpenMPI 2.1.0 + PGI 17.3 = asm test failures

2017-04-28 Thread Prentice Bisbal


Update: removing the -fast switch caused this error to go away.

Prentice

On 04/27/2017 06:00 PM, Prentice Bisbal wrote:
I'm building Open MPI 2.1.0 with PGI 17.3, and now I'm getting 
'illegal instruction' errors during 'make check':


../../config/test-driver: line 107: 65169 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math
- 1 threads: Passed

That's just one example of the error output. See all relevant error 
output below.


Usually, I see these errors when trying to run an executable on a 
processor that doesn't support the instruction set of the executable. 
I used to see this all the time when I supported an IBM Blue Gene/P 
system. I don't think I've ever seen it on an x86 system.


 I'm passing the argument '-tp=x64' to pgcc to build a unified binary, 
so that might be part of the problem, but I've used this exact same 
process to build 2.1.0 with PGI 16.5 just a couple hours ago. I also 
built 1.10.3 with the same compiler flags with PGI 16.5 and 17.3 
without this error.


Any ideas?

The relevant output from 'make check':

make  check-TESTS
make[3]: Entering directory `/local/pbisbal/openmpi-2.1.0/test/asm'
make[4]: Entering directory `/local/pbisbal/openmpi-2.1.0/test/asm'
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_barrier
- 1 threads: Passed
PASS: atomic_barrier
- 2 threads: Passed
PASS: atomic_barrier
- 4 threads: Passed
PASS: atomic_barrier
- 5 threads: Passed
PASS: atomic_barrier
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_barrier_noinline
- 1 threads: Passed
PASS: atomic_barrier_noinline
- 2 threads: Passed
PASS: atomic_barrier_noinline
- 4 threads: Passed
PASS: atomic_barrier_noinline
- 5 threads: Passed
PASS: atomic_barrier_noinline
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_spinlock
- 1 threads: Passed
PASS: atomic_spinlock
- 2 threads: Passed
PASS: atomic_spinlock
- 4 threads: Passed
PASS: atomic_spinlock
- 5 threads: Passed
PASS: atomic_spinlock
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_spinlock_noinline
- 1 threads: Passed
PASS: atomic_spinlock_noinline
- 2 threads: Passed
PASS: atomic_spinlock_noinline
- 4 threads: Passed
PASS: atomic_spinlock_noinline
- 5 threads: Passed
PASS: atomic_spinlock_noinline
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
../../config/test-driver: line 107: 65169 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math
- 1 threads: Passed
../../config/test-driver: line 107: 65172 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math
- 2 threads: Passed
../../config/test-driver: line 107: 65176 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math
- 4 threads: Passed
../../config/test-driver: line 107: 65180 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math
- 5 threads: Passed
../../config/test-driver: line 107: 65185 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
../../config/test-driver: line 107: 65195 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math_noinline
- 1 threads: Passed
../../config/test-driver: line 107: 65198 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math_noinline
- 2 threads: Passed
../../config/test-driver: line 107: 65202 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math_noinline
- 4 threads: Passed
../../config/test-driver: line 107: 65206 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math_noinline
- 5 threads: Passed
../../config/test-driver: line 107: 65210 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math_noinline
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
../../config/test-driver: line 107: 65220 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_cmpset
- 1 threads: Passed
../../config/test-driver: line 107: 65223 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_cmpset
- 2 threads: Passed
../../config/test-driver: line 107: 65227 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_cmpset
- 4 threads: Passed
../../config/test-driver: line 107: 65231 Illegal instruction "$@" > 
$log_file 2

[OMPI users] OpenMPI 2.1.0 + PGI 17.3 = asm test failures

2017-04-27 Thread Prentice Bisbal

I'm building Open MPI 2.1.0 with PGI 17.3, and now I'm getting 'illegal 
instruction' errors during 'make check':


../../config/test-driver: line 107: 65169 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math
- 1 threads: Passed

That's just one example of the error output. See all relevant error 
output below.


Usually, I see these errors when trying to run an executable on a 
processor that doesn't support the instruction set of the executable. I 
used to see this all the time when I supported an IBM Blue Gene/P 
system. I don't think I've ever seen it on an x86 system.


 I'm passing the argument '-tp=x64' to pgcc to build a unified binary, 
so that might be part of the problem, but I've used this exact same 
process to build 2.1.0 with PGI 16.5 just a couple hours ago. I also 
built 1.10.3 with the same compiler flags with PGI 16.5 and 17.3 without 
this error.


Any ideas?

The relevant output from 'make check':

make  check-TESTS
make[3]: Entering directory `/local/pbisbal/openmpi-2.1.0/test/asm'
make[4]: Entering directory `/local/pbisbal/openmpi-2.1.0/test/asm'
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_barrier
- 1 threads: Passed
PASS: atomic_barrier
- 2 threads: Passed
PASS: atomic_barrier
- 4 threads: Passed
PASS: atomic_barrier
- 5 threads: Passed
PASS: atomic_barrier
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_barrier_noinline
- 1 threads: Passed
PASS: atomic_barrier_noinline
- 2 threads: Passed
PASS: atomic_barrier_noinline
- 4 threads: Passed
PASS: atomic_barrier_noinline
- 5 threads: Passed
PASS: atomic_barrier_noinline
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_spinlock
- 1 threads: Passed
PASS: atomic_spinlock
- 2 threads: Passed
PASS: atomic_spinlock
- 4 threads: Passed
PASS: atomic_spinlock
- 5 threads: Passed
PASS: atomic_spinlock
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
PASS: atomic_spinlock_noinline
- 1 threads: Passed
PASS: atomic_spinlock_noinline
- 2 threads: Passed
PASS: atomic_spinlock_noinline
- 4 threads: Passed
PASS: atomic_spinlock_noinline
- 5 threads: Passed
PASS: atomic_spinlock_noinline
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
../../config/test-driver: line 107: 65169 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math
- 1 threads: Passed
../../config/test-driver: line 107: 65172 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math
- 2 threads: Passed
../../config/test-driver: line 107: 65176 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math
- 4 threads: Passed
../../config/test-driver: line 107: 65180 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math
- 5 threads: Passed
../../config/test-driver: line 107: 65185 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
../../config/test-driver: line 107: 65195 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math_noinline
- 1 threads: Passed
../../config/test-driver: line 107: 65198 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math_noinline
- 2 threads: Passed
../../config/test-driver: line 107: 65202 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math_noinline
- 4 threads: Passed
../../config/test-driver: line 107: 65206 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math_noinline
- 5 threads: Passed
../../config/test-driver: line 107: 65210 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_math_noinline
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
../../config/test-driver: line 107: 65220 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_cmpset
- 1 threads: Passed
../../config/test-driver: line 107: 65223 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_cmpset
- 2 threads: Passed
../../config/test-driver: line 107: 65227 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_cmpset
- 4 threads: Passed
../../config/test-driver: line 107: 65231 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_cmpset
- 5 threads: Passed
../../config/test-driver: line 107: 65235 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_cmpset
- 8 threads: Passed
basename: extra operand `--test-name'
Try `basename --help' for more information.
--> Testing
../../config/test-driver: line 107: 65245 Illegal instruction "$@" > 
$log_file 2>&1

FAIL: atomic_cmpset_noinline
- 1 threads: Passed
../../config/test-driver: line

Re: [OMPI users] OpenMPI 2.1.0: FAIL: opal_path_nfs

2017-04-26 Thread Prentice Bisbal

That's what I figured, but I wanted to check first. Any idea of exactly 
what it's trying to check?


Prentice

On 04/26/2017 05:54 PM, r...@open-mpi.org wrote:

You can probably safely ignore it.


On Apr 26, 2017, at 2:29 PM, Prentice Bisbal <pbis...@pppl.gov> wrote:

I'm trying to build OpenMPI 2.1.0 with GCC 5.4.0 on CentOS 6.8. After working 
around the '-Lyes/lib' errors I reported in my previous post, opal_path_nfs 
fails during 'make check' (see below). Is this failure critical, or is it 
something I can ignore and continue with my install? Googling only returned 
links to discussions of similar problems from 4-5 years ago with earlier 
versions of OpenMPI.

STDOUT and STDERR from 'make check':

make  check-TESTS
make[3]: Entering directory `/local/pbisbal/openmpi-2.1.0/test/util'
make[4]: Entering directory `/local/pbisbal/openmpi-2.1.0/test/util'
PASS: opal_bit_ops
FAIL: opal_path_nfs

Testsuite summary for Open MPI 2.1.0

# TOTAL: 2
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

See test/util/test-suite.log
Please report to http://www.open-mpi.org/community/help/


Contents of test/util/test-suite.log:

cat test/util/test-suite.log
==
   Open MPI 2.1.0: test/util/test-suite.log
==

# TOTAL: 2
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: opal_path_nfs
===

Test usage: ./opal_path_nfs [DIR]
On Linux interprets output from mount(8) to check for nfs and verify 
opal_path_nfs()
Additionally, you may specify multiple DIR on the cmd-line, of which you the 
output
get_mounts: dirs[0]:/ fs:rootfs nfs:No
get_mounts: dirs[1]:/proc fs:proc nfs:No
get_mounts: dirs[2]:/sys fs:sysfs nfs:No
get_mounts: dirs[3]:/dev fs:devtmpfs nfs:No
get_mounts: dirs[4]:/dev/pts fs:devpts nfs:No
get_mounts: dirs[5]:/dev/shm fs:tmpfs nfs:No
get_mounts: already know dir[0]:/
get_mounts: dirs[0]:/ fs:nfs nfs:Yes
get_mounts: dirs[6]:/proc/bus/usb fs:usbfs nfs:No
get_mounts: dirs[7]:/var/lib/stateless/writable fs:tmpfs nfs:No
get_mounts: dirs[8]:/var/cache/man fs:tmpfs nfs:No
get_mounts: dirs[9]:/var/lock fs:tmpfs nfs:No
get_mounts: dirs[10]:/var/log fs:tmpfs nfs:No
get_mounts: dirs[11]:/var/run fs:tmpfs nfs:No
get_mounts: dirs[12]:/var/lib/dbus fs:tmpfs nfs:No
get_mounts: dirs[13]:/var/lib/nfs fs:tmpfs nfs:No
get_mounts: dirs[14]:/tmp fs:tmpfs nfs:No
get_mounts: dirs[15]:/var/cache/foomatic fs:tmpfs nfs:No
get_mounts: dirs[16]:/var/cache/hald fs:tmpfs nfs:No
get_mounts: dirs[17]:/var/cache/logwatch fs:tmpfs nfs:No
get_mounts: dirs[18]:/var/lib/dhclient fs:tmpfs nfs:No
get_mounts: dirs[19]:/var/tmp fs:tmpfs nfs:No
get_mounts: dirs[20]:/media fs:tmpfs nfs:No
get_mounts: dirs[21]:/etc/adjtime fs:tmpfs nfs:No
get_mounts: dirs[22]:/etc/ntp.conf fs:tmpfs nfs:No
get_mounts: dirs[23]:/etc/resolv.conf fs:tmpfs nfs:No
get_mounts: dirs[24]:/etc/lvm/archive fs:tmpfs nfs:No
get_mounts: dirs[25]:/etc/lvm/backup fs:tmpfs nfs:No
get_mounts: dirs[26]:/var/account fs:tmpfs nfs:No
get_mounts: dirs[27]:/var/lib/iscsi fs:tmpfs nfs:No
get_mounts: dirs[28]:/var/lib/logrotate.status fs:tmpfs nfs:No
get_mounts: dirs[29]:/var/lib/ntp fs:tmpfs nfs:No
get_mounts: dirs[30]:/var/spool fs:tmpfs nfs:No
get_mounts: dirs[31]:/var/lib/sss fs:tmpfs nfs:No
get_mounts: dirs[32]:/etc/sysconfig/network-scripts fs:tmpfs nfs:No
get_mounts: dirs[33]:/var fs:ext4 nfs:No
get_mounts: already know dir[14]:/tmp
get_mounts: dirs[14]:/tmp fs:ext4 nfs:No
get_mounts: dirs[34]:/local fs:ext4 nfs:No
get_mounts: dirs[35]:/proc/sys/fs/binfmt_misc fs:binfmt_misc nfs:No
get_mounts: dirs[36]:/local/cgroup/cpuset fs:cgroup nfs:No
get_mounts: dirs[37]:/local/cgroup/cpu fs:cgroup nfs:No
get_mounts: dirs[38]:/local/cgroup/cpuacct fs:cgroup nfs:No
get_mounts: dirs[39]:/local/cgroup/memory fs:cgroup nfs:No
get_mounts: dirs[40]:/local/cgroup/devices fs:cgroup nfs:No
get_mounts: dirs[41]:/local/cgroup/freezer fs:cgroup nfs:No
get_mounts: dirs[42]:/local/cgroup/net_cls fs:cgroup nfs:No
get_mounts: dirs[43]:/local/cgroup/blkio fs:cgroup nfs:No
get_mounts: dirs[44]:/usr/pppl fs:nfs nfs:Yes
get_mounts: dirs[45]:/misc fs:autofs nfs:No
get_mounts: dirs[46]:/net fs:autofs nfs:No
get_mounts: dirs[47]:/v fs:autofs nfs:No
get_mounts: dirs[48]:/u fs:autofs nfs:No
get_mounts: dirs[49]:/w fs:autofs nfs:No
get_mounts: dirs[50]:/l fs:autofs nfs:No
get_mounts: dirs[51]:/p fs:autofs nfs:No
get_mounts: dirs[52]:/pfs fs:autofs nfs:No
get_mounts: dirs[53]:/proc/fs/nfsd fs:nfsd nfs:No
get_mounts: dirs[54]:/u/gtchilin fs:nfs nfs:Yes
get_mounts: dirs[55]:/u/ldelgado fs:nfs nfs:Yes
get_mounts: dirs[56]:/p/incoherent fs:nfs nfs:Yes
get_mounts

Re: [OMPI users] OpenMPI 2.1.0 build error: yes/lib: No such file or director

2017-04-26 Thread Prentice Bisbal


Edgar,

Thank you for the suggestion. That fixed this problem.

Prentice

On 04/26/2017 05:25 PM, Edgar Gabriel wrote:
Can you try to just skip the --with-lustre option ? The option really 
is there to provide an alternative path, if the lustre libraries are 
not installed in the default directories ( e.g. 
--with-lustre=/opt/lustre/).  There is obviously a bug that the 
system did not recognize the missing argument. However, if the lustre 
libraries and headers are installed in the default location (i.e. 
/usr/), the configure logic will pick it up and compile it even if you 
do not provide the --with-lustre argument.


Thanks

Edgar


On 4/26/2017 4:18 PM, Prentice Bisbal wrote:
I'm getting the following error when I build OpenMPI 2.1.0 with GCC 
5.4.0:


/bin/sh ../../../../libtool  --tag=CC   --mode=link gcc  -O3 -DNDEBUG
-finline-functions -fno-strict-aliasing -pthread -module -avoid-version
-Lyes/lib  -o libmca_fs_lustre.la  fs_lustre.lo fs_lustre_component.lo
fs_lustre_file_open.lo fs_lustre_file_close.lo fs_lustre_file_delete.lo
fs_lustre_file_sync.lo fs_lustre_file_set_size.lo
fs_lustre_file_get_size.lo -llustreapi  -lrt -lm -lutil
../../../../libtool: line 7489: cd: yes/lib: No such file or directory
libtool:   error: cannot determine absolute directory name of 'yes/lib'
make[2]: *** [libmca_fs_lustre.la] Error 1
make[2]: Leaving directory 
`/local/pbisbal/openmpi-2.1.0/ompi/mca/fs/lustre'

make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/local/pbisbal/openmpi-2.1.0/ompi'
make: *** [all-recursive] Error 1

Obviously, the problem is this argument to libtool in the above command:

-Lyes/lib

I've worked around this by going into ompi/mca/fs/lustre, running that
same libtool command but changing "-Lyes/lib" to "-L/lib", and then
resuming my build from the top level, I figured I'd report this error
here, to see if this a problem caused by me, or a bug in the configure
script.

When I do 'make check', I get another error caused by the same bad 
argument:


/bin/sh ../../libtool  --tag=CC   --mode=link gcc  -O3 -DNDEBUG
-finline-functions -fno-strict-aliasing -pthread
-L/usr/pppl/slurm/15.08.8/lib -Lyes/lib-Wl,-rpath
-Wl,/usr/pppl/slurm/15.08.8/lib -Wl,-rpath -Wl,yes/lib -Wl,-rpath
-Wl,/usr/pppl/gcc/5.4-pkgs/openmpi-2.1.0/lib -Wl,--enable-new-dtags  -o
external32 external32.o ../../ompi/libmpi.la ../../opal/libopen-pal.la
-lrt -lm -lutil
../../libtool: line 7489: cd: yes/lib: No such file or directory
libtool:   error: cannot determine absolute directory name of 'yes/lib'
make[3]: *** [external32] Error 1
make[3]: Leaving directory `/local/pbisbal/openmpi-2.1.0/test/datatype'
make[2]: *** [check-am] Error 2
make[2]: Leaving directory `/local/pbisbal/openmpi-2.1.0/test/datatype'
make[1]: *** [check-recursive] Error 1
make[1]: Leaving directory `/local/pbisbal/openmpi-2.1.0/test'
make: *** [check-recursive] Error 1

For reference, here is my configure command:

./configure \
--prefix=/usr/pppl/gcc/5.4-pkgs/openmpi-2.1.0 \
--disable-silent-rules \
--enable-mpi-fortran \
--enable-mpi-cxx \
--enable-shared \
--enable-static \
--enable-mpi-thread-multiple \
--with-cuda=/usr/pppl/cuda/cudatoolkit/6.5.14 \
--with-pmix \
--with-verbs \
--with-hwloc \
--with-pmi=/usr/pppl/slurm/15.08.8 \
--with-slurm \
--with-lustre \
--with-psm \
CC=gcc \
CXX=g++ \
FC=gfortran \
2>&1 | tee configure.log



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] OpenMPI 2.1.0: FAIL: opal_path_nfs

2017-04-26 Thread Prentice Bisbal

I'm trying to build OpenMPI 2.1.0 with GCC 5.4.0 on CentOS 6.8. After 
working around the '-Lyes/lib' errors I reported in my previous post, 
opal_path_nfs fails during 'make check' (see below). Is this failure 
critical, or is it something I can ignore and continue with my install? 
Googling only returned links to discussions of similar problems from 4-5 
years ago with earlier versions of OpenMPI.


STDOUT and STDERR from 'make check':

make  check-TESTS
make[3]: Entering directory `/local/pbisbal/openmpi-2.1.0/test/util'
make[4]: Entering directory `/local/pbisbal/openmpi-2.1.0/test/util'
PASS: opal_bit_ops
FAIL: opal_path_nfs

Testsuite summary for Open MPI 2.1.0

# TOTAL: 2
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

See test/util/test-suite.log
Please report to http://www.open-mpi.org/community/help/


Contents of test/util/test-suite.log:

cat test/util/test-suite.log
==
   Open MPI 2.1.0: test/util/test-suite.log
==

# TOTAL: 2
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: opal_path_nfs
===

Test usage: ./opal_path_nfs [DIR]
On Linux interprets output from mount(8) to check for nfs and verify 
opal_path_nfs()
Additionally, you may specify multiple DIR on the cmd-line, of which you 
the output

get_mounts: dirs[0]:/ fs:rootfs nfs:No
get_mounts: dirs[1]:/proc fs:proc nfs:No
get_mounts: dirs[2]:/sys fs:sysfs nfs:No
get_mounts: dirs[3]:/dev fs:devtmpfs nfs:No
get_mounts: dirs[4]:/dev/pts fs:devpts nfs:No
get_mounts: dirs[5]:/dev/shm fs:tmpfs nfs:No
get_mounts: already know dir[0]:/
get_mounts: dirs[0]:/ fs:nfs nfs:Yes
get_mounts: dirs[6]:/proc/bus/usb fs:usbfs nfs:No
get_mounts: dirs[7]:/var/lib/stateless/writable fs:tmpfs nfs:No
get_mounts: dirs[8]:/var/cache/man fs:tmpfs nfs:No
get_mounts: dirs[9]:/var/lock fs:tmpfs nfs:No
get_mounts: dirs[10]:/var/log fs:tmpfs nfs:No
get_mounts: dirs[11]:/var/run fs:tmpfs nfs:No
get_mounts: dirs[12]:/var/lib/dbus fs:tmpfs nfs:No
get_mounts: dirs[13]:/var/lib/nfs fs:tmpfs nfs:No
get_mounts: dirs[14]:/tmp fs:tmpfs nfs:No
get_mounts: dirs[15]:/var/cache/foomatic fs:tmpfs nfs:No
get_mounts: dirs[16]:/var/cache/hald fs:tmpfs nfs:No
get_mounts: dirs[17]:/var/cache/logwatch fs:tmpfs nfs:No
get_mounts: dirs[18]:/var/lib/dhclient fs:tmpfs nfs:No
get_mounts: dirs[19]:/var/tmp fs:tmpfs nfs:No
get_mounts: dirs[20]:/media fs:tmpfs nfs:No
get_mounts: dirs[21]:/etc/adjtime fs:tmpfs nfs:No
get_mounts: dirs[22]:/etc/ntp.conf fs:tmpfs nfs:No
get_mounts: dirs[23]:/etc/resolv.conf fs:tmpfs nfs:No
get_mounts: dirs[24]:/etc/lvm/archive fs:tmpfs nfs:No
get_mounts: dirs[25]:/etc/lvm/backup fs:tmpfs nfs:No
get_mounts: dirs[26]:/var/account fs:tmpfs nfs:No
get_mounts: dirs[27]:/var/lib/iscsi fs:tmpfs nfs:No
get_mounts: dirs[28]:/var/lib/logrotate.status fs:tmpfs nfs:No
get_mounts: dirs[29]:/var/lib/ntp fs:tmpfs nfs:No
get_mounts: dirs[30]:/var/spool fs:tmpfs nfs:No
get_mounts: dirs[31]:/var/lib/sss fs:tmpfs nfs:No
get_mounts: dirs[32]:/etc/sysconfig/network-scripts fs:tmpfs nfs:No
get_mounts: dirs[33]:/var fs:ext4 nfs:No
get_mounts: already know dir[14]:/tmp
get_mounts: dirs[14]:/tmp fs:ext4 nfs:No
get_mounts: dirs[34]:/local fs:ext4 nfs:No
get_mounts: dirs[35]:/proc/sys/fs/binfmt_misc fs:binfmt_misc nfs:No
get_mounts: dirs[36]:/local/cgroup/cpuset fs:cgroup nfs:No
get_mounts: dirs[37]:/local/cgroup/cpu fs:cgroup nfs:No
get_mounts: dirs[38]:/local/cgroup/cpuacct fs:cgroup nfs:No
get_mounts: dirs[39]:/local/cgroup/memory fs:cgroup nfs:No
get_mounts: dirs[40]:/local/cgroup/devices fs:cgroup nfs:No
get_mounts: dirs[41]:/local/cgroup/freezer fs:cgroup nfs:No
get_mounts: dirs[42]:/local/cgroup/net_cls fs:cgroup nfs:No
get_mounts: dirs[43]:/local/cgroup/blkio fs:cgroup nfs:No
get_mounts: dirs[44]:/usr/pppl fs:nfs nfs:Yes
get_mounts: dirs[45]:/misc fs:autofs nfs:No
get_mounts: dirs[46]:/net fs:autofs nfs:No
get_mounts: dirs[47]:/v fs:autofs nfs:No
get_mounts: dirs[48]:/u fs:autofs nfs:No
get_mounts: dirs[49]:/w fs:autofs nfs:No
get_mounts: dirs[50]:/l fs:autofs nfs:No
get_mounts: dirs[51]:/p fs:autofs nfs:No
get_mounts: dirs[52]:/pfs fs:autofs nfs:No
get_mounts: dirs[53]:/proc/fs/nfsd fs:nfsd nfs:No
get_mounts: dirs[54]:/u/gtchilin fs:nfs nfs:Yes
get_mounts: dirs[55]:/u/ldelgado fs:nfs nfs:Yes
get_mounts: dirs[56]:/p/incoherent fs:nfs nfs:Yes
get_mounts: dirs[57]:/u/bgriers fs:nfs nfs:Yes
get_mounts: dirs[58]:/p/beam fs:nfs nfs:Yes
get_mounts: dirs[59]:/u/ghao fs:nfs nfs:Yes
get_mounts: dirs[60]:/u/slazerso fs:nfs nfs:Yes
get_mounts: dirs[61]:/p/tsc fs:nfs nfs:Yes
get_mounts: dirs[62]:/p/stellopt fs:nfs nfs:Yes
get_mounts:

[OMPI users] OpenMPI 2.1.0 build error: yes/lib: No such file or director

2017-04-26 Thread Prentice Bisbal


I'm getting the following error when I build OpenMPI 2.1.0 with GCC 5.4.0:

/bin/sh ../../../../libtool  --tag=CC   --mode=link gcc  -O3 -DNDEBUG 
-finline-functions -fno-strict-aliasing -pthread -module -avoid-version 
-Lyes/lib  -o libmca_fs_lustre.la  fs_lustre.lo fs_lustre_component.lo 
fs_lustre_file_open.lo fs_lustre_file_close.lo fs_lustre_file_delete.lo 
fs_lustre_file_sync.lo fs_lustre_file_set_size.lo 
fs_lustre_file_get_size.lo -llustreapi  -lrt -lm -lutil

../../../../libtool: line 7489: cd: yes/lib: No such file or directory
libtool:   error: cannot determine absolute directory name of 'yes/lib'
make[2]: *** [libmca_fs_lustre.la] Error 1
make[2]: Leaving directory `/local/pbisbal/openmpi-2.1.0/ompi/mca/fs/lustre'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/local/pbisbal/openmpi-2.1.0/ompi'
make: *** [all-recursive] Error 1

Obviously, the problem is this argument to libtool in the above command:

-Lyes/lib

I've worked around this by going into ompi/mca/fs/lustre, running that 
same libtool command but changing "-Lyes/lib" to "-L/lib", and then 
resuming my build from the top level, I figured I'd report this error 
here, to see if this a problem caused by me, or a bug in the configure 
script.


When I do 'make check', I get another error caused by the same bad argument:

/bin/sh ../../libtool  --tag=CC   --mode=link gcc  -O3 -DNDEBUG 
-finline-functions -fno-strict-aliasing -pthread 
-L/usr/pppl/slurm/15.08.8/lib -Lyes/lib-Wl,-rpath 
-Wl,/usr/pppl/slurm/15.08.8/lib -Wl,-rpath -Wl,yes/lib -Wl,-rpath 
-Wl,/usr/pppl/gcc/5.4-pkgs/openmpi-2.1.0/lib -Wl,--enable-new-dtags  -o 
external32 external32.o ../../ompi/libmpi.la ../../opal/libopen-pal.la 
-lrt -lm -lutil

../../libtool: line 7489: cd: yes/lib: No such file or directory
libtool:   error: cannot determine absolute directory name of 'yes/lib'
make[3]: *** [external32] Error 1
make[3]: Leaving directory `/local/pbisbal/openmpi-2.1.0/test/datatype'
make[2]: *** [check-am] Error 2
make[2]: Leaving directory `/local/pbisbal/openmpi-2.1.0/test/datatype'
make[1]: *** [check-recursive] Error 1
make[1]: Leaving directory `/local/pbisbal/openmpi-2.1.0/test'
make: *** [check-recursive] Error 1

For reference, here is my configure command:

./configure \
  --prefix=/usr/pppl/gcc/5.4-pkgs/openmpi-2.1.0 \
  --disable-silent-rules \
  --enable-mpi-fortran \
  --enable-mpi-cxx \
  --enable-shared \
  --enable-static \
  --enable-mpi-thread-multiple \
  --with-cuda=/usr/pppl/cuda/cudatoolkit/6.5.14 \
  --with-pmix \
  --with-verbs \
  --with-hwloc \
  --with-pmi=/usr/pppl/slurm/15.08.8 \
  --with-slurm \
  --with-lustre \
  --with-psm \
  CC=gcc \
  CXX=g++ \
  FC=gfortran \
  2>&1 | tee configure.log

--
Prentice

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-26 Thread Prentice Bisbal


Everyone,

I just wanted to follow up on this, to help others, or possibly even a 
future me, having problems compiling OpenMPI with the PGI compilers. I 
did get it to work a few weeks ago, but I've been too busy to share my 
solution here. I need to give  a shout out to Matt Thompson for 
providing the missing link: the siterc file (See 
https://www.mail-archive.com/users@lists.open-mpi.org/msg30918.html)


Here's what I did, step by step. This is on a CentOS 6.8 system:

1. Create a siterc file with the following contents in the bin directory 
where your pgcc, pgfortran, etc. live. For me, I installed PGI 17.3 in 
/usr/pppl/pgi/17.3, so this file is located at 
/usr/pppl/pgi/17.3/linux86-64/17.3/bin/siterc:


$ cat  /usr/pppl/pgi/17.3/linux86-64/17.3/bin/siterc
#
# siterc for gcc commands PGI does not support
#

switch -pthread is
 append(LDLIB1=-lpthread);




This will prevent this error:

pgcc-Error-Unknown switch: -pthread


2. During the configure step, specify the -fPIC explicitly in CFLAGS, 
CXXFLAGS, and FCFLAGS. For some reason, this isn't added automatically 
for PGI, which leads to a linking failure deep into the build process. 
Here's my configure command:


./configure \
  --prefix=/usr/pppl/pgi/17.3-pkgs/openmpi-1.10.3 \
  --disable-silent-rules \
  --enable-shared \
  --enable-static \
  --enable-mpi-thread-multiple \
  --with-pmi=/usr/pppl/slurm/15.08.8 \
  --with-hwloc \
  --with-verbs \
  --with-slurm \
  --with-psm \
  CC=pgcc \
  CFLAGS="-fPIC -tp=x64 -fast" \
  CXX=pgc++ \
  CXXFLAGS="-fPIC -tp=x64 -fast" \
  FC=pgfortran \
  FCFLAGS="-fPIC -tp=x64 -fast" \
  2>&1 | tee configure.log

Obviously, you probably won't be specifying all the same options. The 
'-tp=x64' tells PGI to create a 'unified binary' that will run optimally 
on all the 64-bit x86 processors (according to PGI). Technically, I 
should be specifying '-fpic' instead of '-fPIC', but PGI accepts '-fPIC' 
for compatibility with other compilers, and I typed '-fPIC' out of habit.


That's it! Those two changes allowed me to build and install OpenMPI 
1.10.3 with PGI 17.3



Prentice

On 04/03/2017 10:20 AM, Prentice Bisbal wrote:
Greeting Open MPI users! After being off this list for several years, 
I'm back! And I need help:


I'm trying to compile OpenMPI 1.10.3 with the PGI compilers, version 
17.3. I'm using the following configure options:


./configure \
  --prefix=/usr/pppl/pgi/17.3-pkgs/openmpi-1.10.3 \
  --disable-silent-rules \
  --enable-shared \
  --enable-static \
  --enable-mpi-thread-multiple \
  --with-pmi=/usr/pppl/slurm/15.08.8 \
  --with-hwloc \
  --with-verbs \
  --with-slurm \
  --with-psm \
  CC=pgcc \
  CFLAGS="-tp x64 -fast" \
  CXX=pgc++ \
  CXXFLAGS="-tp x64 -fast" \
  FC=pgfortran \
  FCFLAGS="-tp x64 -fast" \
  2>&1 | tee configure.log

Which leads to this error  from libtool during make:

pgcc-Error-Unknown switch: -pthread

I've searched the archives, which ultimately lead to this work around 
from 2009:


https://www.open-mpi.org/community/lists/users/2009/04/8724.php

Interestingly, I participated in the discussion that lead to that 
workaround, stating that I had no problem compiling Open MPI with PGI 
v9. I'm assuming the problem now is that I'm specifying 
--enable-mpi-thread-multiple, which I'm doing because a user requested 
that feature.


It's been exactly 8 years and 2 days since that workaround was posted 
to the list. Please tell me a better way of dealing with this issue 
than writing a 'fakepgf90' script. Any suggestions?





___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-04 Thread Prentice Bisbal

Matt,

Thank you so much! I think you might have cracked the case for me. Yes, 
I'm on Linux, and I just looked up siterc and userrc files in the PGI 
userguide. I think I'm going to start with a userrc file, since I prefer 
to minimize customization as much as possible, and to test without 
affecting other users. I have run into other issues with PGI, too, after 
fixing the -pthread issue, which I'll bring up in a separate email.

Prentice

On 04/03/2017 06:24 PM, Matt Thompson wrote:
Coming in near the end here. I've had "fun" with PGI + Open MPI + 
macOS (and still haven't quite solved it, see: 
https://www.mail-archive.com/users@lists.open-mpi.org//msg30865.html, 
still unanswered!) The solution that PGI gave me, and which seems the 
magic sauce on macOS is to use a siterc file 
(http://www.pgroup.com/userforum/viewtopic.php?p=21105#21105):

=
siterc for gcc commands PGI does not support
=
switch -ffast-math is hide;

switch -pipe is hide;

switch -fexpensive-optimizations is hide;

switch -pthread is
append(LDLIB1= -lpthread);

switch -qversion is
early
help(Display compiler version)
helpgroup(overall)
set(VERSION=YES);

switch -Wno-deprecated-declarations is hide;

switch -flat_namespace is hide;

If you use that, -pthread is "rerouted" to append -lpthread. You might 
try that and see if that helps. Since you are on Linux (I assume?), 
then you should be able to proceed as you shouldn't encounter the 
libtool bug/issue/*shrug* that is breaking macOS use.

On Mon, Apr 3, 2017 at 5:14 PM, Reuti <re...@staff.uni-marburg.de 
<mailto:re...@staff.uni-marburg.de>> wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

    Am 03.04.2017 um 23:07 schrieb Prentice Bisbal:

> FYI - the proposed 'here-doc' solution below didn't work for me,
it produced an error. Neither did printf. When I used printf, only
the first arg was passed along:
>
> #!/bin/bash
>
> realcmd=/usr/pppl/pgi/17.3/linux86-64/17.3/bin/pgcc.real
> echo "original args: $@"
> newargs=$(printf -- "$@" | sed s/-pthread//g)

The format string is missing:

printf "%s " "$@"

> echo "new args: $newargs"
> #$realcmd $newargs
> exit
>
> $ pgcc -tp=x64 -fast conftest.c
> original args: -tp=x64 -fast conftest.c
> new args: -tp=x64
>
> Any ideas what I might be doing wrong here?
>
> So, my original echo "" "$@" solution works, and another
colleague also suggested this expressions, which appears to work, too:
>
> newargs=${@/-pthread/}
    >
    > Although I don't know how portable that is. I'm guessing that's
very bash-specific syntax.
>
> Prentice
>
> On 04/03/2017 04:26 PM, Prentice Bisbal wrote:
>> A coworker came up with another idea that works, too:
>>
>> newargs=sed s/-pthread//g <> $@
>> EOF
>>
>> That should work, too, but I haven't test it.
>>
>> Prentice
    >>
    >> On 04/03/2017 04:11 PM, Andy Riebs wrote:
>>> Try
>>> $ printf -- "-E" ...
>>>
>>> On 04/03/2017 04:03 PM, Prentice Bisbal wrote:
>>>> Okay. the additional -E doesn't work,either. :(
>>>>
>>>> Prentice Bisbal Lead Software Engineer Princeton Plasma
Physics Laboratory http://www.pppl.gov
>>>> On 04/03/2017 04:01 PM, Prentice Bisbal wrote:
>>>>> Nevermind. A coworker helped me figure this one out. Echo is
treating the '-E' as an argument to echo and interpreting it
instead of passing it to sed. Since that's used by the configure
tests, that's a bit of a problem, Just adding another -E before
$@, should fix the problem.
>>>>>
>>>>> Prentice
>>>>>
>>>>> On 04/03/2017 03:54 PM, Prentice Bisbal wrote:
>>>>>> I've decided to work around this problem by creating a
wrapper script for pgcc that strips away the -pthread argument,
but my sed expression works on the command-line, but not in the
script. I'm essentially reproducing the workaround from
https://www.open-mpi.org/community/lists/users/2009/04/8724.php
<https://www.open-mpi.org/community/lists/users/2009/04/8724.php>.
>>>>>>
>>>>>> Can anyone see what's wrong with my implementation the
workaround? It's a very simple sed expression. Here's my script:
>>>>>>
>>>>>> #!/bin/bash
>>>>>>
>>>>>> realcmd=/path/to/pg

Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-03 Thread Prentice Bisbal

FYI - the proposed 'here-doc' solution below didn't work for me, it 
produced an error. Neither did printf. When I used printf, only the 
first arg was passed along:


#!/bin/bash

realcmd=/usr/pppl/pgi/17.3/linux86-64/17.3/bin/pgcc.real
echo "original args: $@"
newargs=$(printf -- "$@" | sed s/-pthread//g)
echo "new args: $newargs"
#$realcmd $newargs
exit

$ pgcc -tp=x64 -fast conftest.c
original args: -tp=x64 -fast conftest.c
new args: -tp=x64

Any ideas what I might be doing wrong here?

So, my original echo "" "$@" solution works, and another colleague also 
suggested this expressions, which appears to work, too:


newargs=${@/-pthread/}

Although I don't know how portable that is. I'm guessing that's very 
bash-specific syntax.


Prentice

On 04/03/2017 04:26 PM, Prentice Bisbal wrote:

A coworker came up with another idea that works, too:

newargs=sed s/-pthread//g <
Try
$ printf -- "-E" ...

On 04/03/2017 04:03 PM, Prentice Bisbal wrote:

Okay. the additional -E doesn't work,either. :(

Prentice Bisbal Lead Software Engineer Princeton Plasma Physics 
Laboratory http://www.pppl.gov

On 04/03/2017 04:01 PM, Prentice Bisbal wrote:
Nevermind. A coworker helped me figure this one out. Echo is 
treating the '-E' as an argument to echo and interpreting it 
instead of passing it to sed. Since that's used by the configure 
tests, that's a bit of a problem, Just adding another -E before $@, 
should fix the problem.


Prentice

On 04/03/2017 03:54 PM, Prentice Bisbal wrote:
I've decided to work around this problem by creating a wrapper 
script for pgcc that strips away the -pthread argument, but my sed 
expression works on the command-line, but not in the script. I'm 
essentially reproducing the workaround from 
https://www.open-mpi.org/community/lists/users/2009/04/8724.php.


Can anyone see what's wrong with my implementation the workaround? 
It's a very simple sed expression. Here's my script:


#!/bin/bash

realcmd=/path/to/pgcc
echo "original args: $@"
newargs=$(echo "$@" | sed s/-pthread//)
echo "new args: $newargs"
#$realcmd $newargs
exit

And here's what happens when I run it:

 /path/to/pgcc -E conftest.c
original args: -E conftest.c
new args: conftest.c

As you can see, the -E argument is getting lost in translation. If 
I add more arguments, it works fine:


/path/to/pgcc -A -B -C -D -E conftest.c
original args: -A -B -C -D -E conftest.c
new args: -A -B -C -D -E conftest.c

It only seems to be a problem when -E is the first argument:

$ /path/to/pgcc -E -D -C -B -A conftest.c
original args: -E -D -C -B -A conftest.c
new args: -D -C -B -A conftest.c

Prentice

On 04/03/2017 02:24 PM, Aaron Knister wrote:
To be thorough couldn't one replace -pthread in the slurm .la 
files with -lpthread? I ran into this last week and this was the 
solution I was thinking about implementing. Having said that, I 
can't think of a situation in which the -pthread/-lpthread 
argument would be required other than linking against statically 
compiled SLURM libraries and even then I'm not so sure about that.


-Aaron

On 4/3/17 1:46 PM, �ke Sandgren wrote:
We build slurm with GCC, drop the -pthread arg in the .la files, 
and
have never seen any problems related to that. And we do build 
quite a
lot of code. And lots of versions of OpenMPI with multiple 
different

compilers (and versions).

On 04/03/2017 04:51 PM, Prentice Bisbal wrote:

This is the second suggestion to rebuild Slurm

The  other from �ke Sandgren, who recommended this:


This usually comes from slurm, so we always do

perl -pi -e 's/-pthread//' /lap/slurm/${version}/lib/libpmi.la
/lap/slurm/${version}/lib/libslurm.la

when installing a new slurm version. Thus no need for a fakepg 
wrapper.


I don't really have the luxury to rebuild Slurm at the moment. 
How would
I rebuild Slurm to change this behavior? Is rebuilding Slurm 
with PGI
the only option to fix this in slurm, or use �ke's suggestion 
above?


If I did use �ke's suggestion above, how would that affect the 
operation
of Slurm, or future builds of OpenMPI and any other software 
that might
rely on Slurm, particulary with regards to building those apps 
with

non-PGI compilers?

Prentice

On 04/03/2017 10:31 AM, Gilles Gouaillardet wrote:

Hi,

The -pthread flag is likely pulled by libtool from the slurm 
libmpi.la

<http://libmpi.la> and/or libslurm.la <http://libslurm.la>
Workarounds are
- rebuild slurm with PGI
- remove the .la files (*.so and/or *.a are enough)
- wrap the PGI compiler to ignore the -pthread option

Hope this helps

Gilles

On Monday, April 3, 2017, Prentice Bisbal <pbis...@pppl.gov
<mailto:pbis...@pppl.gov>> wrote:

Greeting Open MPI users! After being off this list for 
several

years, I'm back! And I need help:

I'm trying to compile OpenMPI 1.10.3 with the PGI compilers,
version 17.3. I'm using the following configure options:

Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-03 Thread Prentice Bisbal


A coworker came up with another idea that works, too:

newargs=sed s/-pthread//g <
Try
$ printf -- "-E" ...

On 04/03/2017 04:03 PM, Prentice Bisbal wrote:

Okay. the additional -E doesn't work,either. :(

Prentice Bisbal Lead Software Engineer Princeton Plasma Physics 
Laboratory http://www.pppl.gov

On 04/03/2017 04:01 PM, Prentice Bisbal wrote:
Nevermind. A coworker helped me figure this one out. Echo is 
treating the '-E' as an argument to echo and interpreting it instead 
of passing it to sed. Since that's used by the configure tests, 
that's a bit of a problem, Just adding another -E before $@, should 
fix the problem.


Prentice

On 04/03/2017 03:54 PM, Prentice Bisbal wrote:
I've decided to work around this problem by creating a wrapper 
script for pgcc that strips away the -pthread argument, but my sed 
expression works on the command-line, but not in the script. I'm 
essentially reproducing the workaround from 
https://www.open-mpi.org/community/lists/users/2009/04/8724.php.


Can anyone see what's wrong with my implementation the workaround? 
It's a very simple sed expression. Here's my script:


#!/bin/bash

realcmd=/path/to/pgcc
echo "original args: $@"
newargs=$(echo "$@" | sed s/-pthread//)
echo "new args: $newargs"
#$realcmd $newargs
exit

And here's what happens when I run it:

 /path/to/pgcc -E conftest.c
original args: -E conftest.c
new args: conftest.c

As you can see, the -E argument is getting lost in translation. If 
I add more arguments, it works fine:


/path/to/pgcc -A -B -C -D -E conftest.c
original args: -A -B -C -D -E conftest.c
new args: -A -B -C -D -E conftest.c

It only seems to be a problem when -E is the first argument:

$ /path/to/pgcc -E -D -C -B -A conftest.c
original args: -E -D -C -B -A conftest.c
new args: -D -C -B -A conftest.c

Prentice

On 04/03/2017 02:24 PM, Aaron Knister wrote:
To be thorough couldn't one replace -pthread in the slurm .la 
files with -lpthread? I ran into this last week and this was the 
solution I was thinking about implementing. Having said that, I 
can't think of a situation in which the -pthread/-lpthread 
argument would be required other than linking against statically 
compiled SLURM libraries and even then I'm not so sure about that.


-Aaron

On 4/3/17 1:46 PM, �ke Sandgren wrote:

We build slurm with GCC, drop the -pthread arg in the .la files, and
have never seen any problems related to that. And we do build 
quite a

lot of code. And lots of versions of OpenMPI with multiple different
compilers (and versions).

On 04/03/2017 04:51 PM, Prentice Bisbal wrote:

This is the second suggestion to rebuild Slurm

The  other from �ke Sandgren, who recommended this:


This usually comes from slurm, so we always do

perl -pi -e 's/-pthread//' /lap/slurm/${version}/lib/libpmi.la
/lap/slurm/${version}/lib/libslurm.la

when installing a new slurm version. Thus no need for a fakepg 
wrapper.


I don't really have the luxury to rebuild Slurm at the moment. 
How would
I rebuild Slurm to change this behavior? Is rebuilding Slurm 
with PGI
the only option to fix this in slurm, or use �ke's suggestion 
above?


If I did use �ke's suggestion above, how would that affect the 
operation
of Slurm, or future builds of OpenMPI and any other software 
that might

rely on Slurm, particulary with regards to building those apps with
non-PGI compilers?

Prentice

On 04/03/2017 10:31 AM, Gilles Gouaillardet wrote:

Hi,

The -pthread flag is likely pulled by libtool from the slurm 
libmpi.la

<http://libmpi.la> and/or libslurm.la <http://libslurm.la>
Workarounds are
- rebuild slurm with PGI
- remove the .la files (*.so and/or *.a are enough)
- wrap the PGI compiler to ignore the -pthread option

Hope this helps

Gilles

On Monday, April 3, 2017, Prentice Bisbal <pbis...@pppl.gov
<mailto:pbis...@pppl.gov>> wrote:

Greeting Open MPI users! After being off this list for several
years, I'm back! And I need help:

I'm trying to compile OpenMPI 1.10.3 with the PGI compilers,
version 17.3. I'm using the following configure options:

./configure \
--prefix=/usr/pppl/pgi/17.3-pkgs/openmpi-1.10.3 \
  --disable-silent-rules \
  --enable-shared \
  --enable-static \
  --enable-mpi-thread-multiple \
  --with-pmi=/usr/pppl/slurm/15.08.8 \
  --with-hwloc \
  --with-verbs \
  --with-slurm \
  --with-psm \
  CC=pgcc \
  CFLAGS="-tp x64 -fast" \
  CXX=pgc++ \
  CXXFLAGS="-tp x64 -fast" \
  FC=pgfortran \
  FCFLAGS="-tp x64 -fast" \
  2>&1 | tee configure.log

Which leads to this error  from libtool during make:

pgcc-Error-Unknown switch: -pthread

I've searched the archives, which ultimately lead to this work
around from 2009:

https://www.open-mpi.org/community/lists/users/2009/04/8724.php
<https://www.open-mpi.org/community/lists/users/2009/04/8724.php>

Interestingl

Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-03 Thread Prentice Bisbal


Okay. the additional -E doesn't work,either. :(

Prentice Bisbal Lead Software Engineer Princeton Plasma Physics 
Laboratory http://www.pppl.gov

On 04/03/2017 04:01 PM, Prentice Bisbal wrote:
Nevermind. A coworker helped me figure this one out. Echo is treating 
the '-E' as an argument to echo and interpreting it instead of passing 
it to sed. Since that's used by the configure tests, that's a bit of a 
problem, Just adding another -E before $@, should fix the problem.


Prentice

On 04/03/2017 03:54 PM, Prentice Bisbal wrote:
I've decided to work around this problem by creating a wrapper script 
for pgcc that strips away the -pthread argument, but my sed 
expression works on the command-line, but not in the script. I'm 
essentially reproducing the workaround from 
https://www.open-mpi.org/community/lists/users/2009/04/8724.php.


Can anyone see what's wrong with my implementation the workaround? 
It's a very simple sed expression. Here's my script:


#!/bin/bash

realcmd=/path/to/pgcc
echo "original args: $@"
newargs=$(echo "$@" | sed s/-pthread//)
echo "new args: $newargs"
#$realcmd $newargs
exit

And here's what happens when I run it:

 /path/to/pgcc -E conftest.c
original args: -E conftest.c
new args: conftest.c

As you can see, the -E argument is getting lost in translation. If I 
add more arguments, it works fine:


/path/to/pgcc -A -B -C -D -E conftest.c
original args: -A -B -C -D -E conftest.c
new args: -A -B -C -D -E conftest.c

It only seems to be a problem when -E is the first argument:

$ /path/to/pgcc -E -D -C -B -A conftest.c
original args: -E -D -C -B -A conftest.c
new args: -D -C -B -A conftest.c

Prentice

On 04/03/2017 02:24 PM, Aaron Knister wrote:
To be thorough couldn't one replace -pthread in the slurm .la files 
with -lpthread? I ran into this last week and this was the solution 
I was thinking about implementing. Having said that, I can't think 
of a situation in which the -pthread/-lpthread argument would be 
required other than linking against statically compiled SLURM 
libraries and even then I'm not so sure about that.


-Aaron

On 4/3/17 1:46 PM, Åke Sandgren wrote:

We build slurm with GCC, drop the -pthread arg in the .la files, and
have never seen any problems related to that. And we do build quite a
lot of code. And lots of versions of OpenMPI with multiple different
compilers (and versions).

On 04/03/2017 04:51 PM, Prentice Bisbal wrote:

This is the second suggestion to rebuild Slurm

The  other from Åke Sandgren, who recommended this:


This usually comes from slurm, so we always do

perl -pi -e 's/-pthread//' /lap/slurm/${version}/lib/libpmi.la
/lap/slurm/${version}/lib/libslurm.la

when installing a new slurm version. Thus no need for a fakepg 
wrapper.


I don't really have the luxury to rebuild Slurm at the moment. How 
would

I rebuild Slurm to change this behavior? Is rebuilding Slurm with PGI
the only option to fix this in slurm, or use Åke's suggestion above?

If I did use Åke's suggestion above, how would that affect the 
operation
of Slurm, or future builds of OpenMPI and any other software that 
might

rely on Slurm, particulary with regards to building those apps with
non-PGI compilers?

Prentice

On 04/03/2017 10:31 AM, Gilles Gouaillardet wrote:

Hi,

The -pthread flag is likely pulled by libtool from the slurm 
libmpi.la

<http://libmpi.la> and/or libslurm.la <http://libslurm.la>
Workarounds are
- rebuild slurm with PGI
- remove the .la files (*.so and/or *.a are enough)
- wrap the PGI compiler to ignore the -pthread option

Hope this helps

Gilles

On Monday, April 3, 2017, Prentice Bisbal <pbis...@pppl.gov
<mailto:pbis...@pppl.gov>> wrote:

Greeting Open MPI users! After being off this list for several
years, I'm back! And I need help:

I'm trying to compile OpenMPI 1.10.3 with the PGI compilers,
version 17.3. I'm using the following configure options:

./configure \
  --prefix=/usr/pppl/pgi/17.3-pkgs/openmpi-1.10.3 \
  --disable-silent-rules \
  --enable-shared \
  --enable-static \
  --enable-mpi-thread-multiple \
  --with-pmi=/usr/pppl/slurm/15.08.8 \
  --with-hwloc \
  --with-verbs \
  --with-slurm \
  --with-psm \
  CC=pgcc \
  CFLAGS="-tp x64 -fast" \
  CXX=pgc++ \
  CXXFLAGS="-tp x64 -fast" \
  FC=pgfortran \
  FCFLAGS="-tp x64 -fast" \
  2>&1 | tee configure.log

Which leads to this error  from libtool during make:

pgcc-Error-Unknown switch: -pthread

I've searched the archives, which ultimately lead to this work
around from 2009:

https://www.open-mpi.org/community/lists/users/2009/04/8724.php
<https://www.open-mpi.org/community/lists/users/2009/04/8724.php>

Interestingly, I participated in the discussion that lead to 
that
workaround, stating that I had no problem compiling Open MPI 
with

PGI v9. I'm assuming the pr

Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-03 Thread Prentice Bisbal

Nevermind. A coworker helped me figure this one out. Echo is treating 
the '-E' as an argument to echo and interpreting it instead of passing 
it to sed. Since that's used by the configure tests, that's a bit of a 
problem, Just adding another -E before $@, should fix the problem.


Prentice

On 04/03/2017 03:54 PM, Prentice Bisbal wrote:
I've decided to work around this problem by creating a wrapper script 
for pgcc that strips away the -pthread argument, but my sed expression 
works on the command-line, but not in the script. I'm essentially 
reproducing the workaround from 
https://www.open-mpi.org/community/lists/users/2009/04/8724.php.


Can anyone see what's wrong with my implementation the workaround? 
It's a very simple sed expression. Here's my script:


#!/bin/bash

realcmd=/path/to/pgcc
echo "original args: $@"
newargs=$(echo "$@" | sed s/-pthread//)
echo "new args: $newargs"
#$realcmd $newargs
exit

And here's what happens when I run it:

 /path/to/pgcc -E conftest.c
original args: -E conftest.c
new args: conftest.c

As you can see, the -E argument is getting lost in translation. If I 
add more arguments, it works fine:


/path/to/pgcc -A -B -C -D -E conftest.c
original args: -A -B -C -D -E conftest.c
new args: -A -B -C -D -E conftest.c

It only seems to be a problem when -E is the first argument:

$ /path/to/pgcc -E -D -C -B -A conftest.c
original args: -E -D -C -B -A conftest.c
new args: -D -C -B -A conftest.c

Prentice

On 04/03/2017 02:24 PM, Aaron Knister wrote:
To be thorough couldn't one replace -pthread in the slurm .la files 
with -lpthread? I ran into this last week and this was the solution I 
was thinking about implementing. Having said that, I can't think of a 
situation in which the -pthread/-lpthread argument would be required 
other than linking against statically compiled SLURM libraries and 
even then I'm not so sure about that.


-Aaron

On 4/3/17 1:46 PM, Åke Sandgren wrote:

We build slurm with GCC, drop the -pthread arg in the .la files, and
have never seen any problems related to that. And we do build quite a
lot of code. And lots of versions of OpenMPI with multiple different
compilers (and versions).

On 04/03/2017 04:51 PM, Prentice Bisbal wrote:

This is the second suggestion to rebuild Slurm

The  other from Åke Sandgren, who recommended this:


This usually comes from slurm, so we always do

perl -pi -e 's/-pthread//' /lap/slurm/${version}/lib/libpmi.la
/lap/slurm/${version}/lib/libslurm.la

when installing a new slurm version. Thus no need for a fakepg 
wrapper.


I don't really have the luxury to rebuild Slurm at the moment. How 
would

I rebuild Slurm to change this behavior? Is rebuilding Slurm with PGI
the only option to fix this in slurm, or use Åke's suggestion above?

If I did use Åke's suggestion above, how would that affect the 
operation
of Slurm, or future builds of OpenMPI and any other software that 
might

rely on Slurm, particulary with regards to building those apps with
non-PGI compilers?

Prentice

On 04/03/2017 10:31 AM, Gilles Gouaillardet wrote:

Hi,

The -pthread flag is likely pulled by libtool from the slurm 
libmpi.la

<http://libmpi.la> and/or libslurm.la <http://libslurm.la>
Workarounds are
- rebuild slurm with PGI
- remove the .la files (*.so and/or *.a are enough)
- wrap the PGI compiler to ignore the -pthread option

Hope this helps

Gilles

On Monday, April 3, 2017, Prentice Bisbal <pbis...@pppl.gov
<mailto:pbis...@pppl.gov>> wrote:

Greeting Open MPI users! After being off this list for several
years, I'm back! And I need help:

I'm trying to compile OpenMPI 1.10.3 with the PGI compilers,
version 17.3. I'm using the following configure options:

./configure \
  --prefix=/usr/pppl/pgi/17.3-pkgs/openmpi-1.10.3 \
  --disable-silent-rules \
  --enable-shared \
  --enable-static \
  --enable-mpi-thread-multiple \
  --with-pmi=/usr/pppl/slurm/15.08.8 \
  --with-hwloc \
  --with-verbs \
  --with-slurm \
  --with-psm \
  CC=pgcc \
  CFLAGS="-tp x64 -fast" \
  CXX=pgc++ \
  CXXFLAGS="-tp x64 -fast" \
  FC=pgfortran \
  FCFLAGS="-tp x64 -fast" \
  2>&1 | tee configure.log

Which leads to this error  from libtool during make:

pgcc-Error-Unknown switch: -pthread

I've searched the archives, which ultimately lead to this work
around from 2009:

https://www.open-mpi.org/community/lists/users/2009/04/8724.php
<https://www.open-mpi.org/community/lists/users/2009/04/8724.php>

Interestingly, I participated in the discussion that lead to that
workaround, stating that I had no problem compiling Open MPI with
PGI v9. I'm assuming the problem now is that I'm specifying
--enable-mpi-thread-multiple, which I'm doing because a user
requested that feature.

It's been exactly 8 years and 2 days since that workaround was
posted

Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-03 Thread Prentice Bisbal

I've decided to work around this problem by creating a wrapper script 
for pgcc that strips away the -pthread argument, but my sed expression 
works on the command-line, but not in the script. I'm essentially 
reproducing the workaround from 
https://www.open-mpi.org/community/lists/users/2009/04/8724.php.


Can anyone see what's wrong with my implementation the workaround? It's 
a very simple sed expression. Here's my script:


#!/bin/bash

realcmd=/path/to/pgcc
echo "original args: $@"
newargs=$(echo "$@" | sed s/-pthread//)
echo "new args: $newargs"
#$realcmd $newargs
exit

And here's what happens when I run it:

 /path/to/pgcc -E conftest.c
original args: -E conftest.c
new args: conftest.c

As you can see, the -E argument is getting lost in translation. If I add 
more arguments, it works fine:


/path/to/pgcc -A -B -C -D -E conftest.c
original args: -A -B -C -D -E conftest.c
new args: -A -B -C -D -E conftest.c

It only seems to be a problem when -E is the first argument:

$ /path/to/pgcc -E -D -C -B -A conftest.c
original args: -E -D -C -B -A conftest.c
new args: -D -C -B -A conftest.c

Prentice

On 04/03/2017 02:24 PM, Aaron Knister wrote:
To be thorough couldn't one replace -pthread in the slurm .la files 
with -lpthread? I ran into this last week and this was the solution I 
was thinking about implementing. Having said that, I can't think of a 
situation in which the -pthread/-lpthread argument would be required 
other than linking against statically compiled SLURM libraries and 
even then I'm not so sure about that.


-Aaron

On 4/3/17 1:46 PM, Åke Sandgren wrote:

We build slurm with GCC, drop the -pthread arg in the .la files, and
have never seen any problems related to that. And we do build quite a
lot of code. And lots of versions of OpenMPI with multiple different
compilers (and versions).

On 04/03/2017 04:51 PM, Prentice Bisbal wrote:

This is the second suggestion to rebuild Slurm

The  other from Åke Sandgren, who recommended this:


This usually comes from slurm, so we always do

perl -pi -e 's/-pthread//' /lap/slurm/${version}/lib/libpmi.la
/lap/slurm/${version}/lib/libslurm.la

when installing a new slurm version. Thus no need for a fakepg 
wrapper.


I don't really have the luxury to rebuild Slurm at the moment. How 
would

I rebuild Slurm to change this behavior? Is rebuilding Slurm with PGI
the only option to fix this in slurm, or use Åke's suggestion above?

If I did use Åke's suggestion above, how would that affect the 
operation

of Slurm, or future builds of OpenMPI and any other software that might
rely on Slurm, particulary with regards to building those apps with
non-PGI compilers?

Prentice

On 04/03/2017 10:31 AM, Gilles Gouaillardet wrote:

Hi,

The -pthread flag is likely pulled by libtool from the slurm libmpi.la
<http://libmpi.la> and/or libslurm.la <http://libslurm.la>
Workarounds are
- rebuild slurm with PGI
- remove the .la files (*.so and/or *.a are enough)
- wrap the PGI compiler to ignore the -pthread option

Hope this helps

Gilles

On Monday, April 3, 2017, Prentice Bisbal <pbis...@pppl.gov
<mailto:pbis...@pppl.gov>> wrote:

Greeting Open MPI users! After being off this list for several
years, I'm back! And I need help:

I'm trying to compile OpenMPI 1.10.3 with the PGI compilers,
version 17.3. I'm using the following configure options:

./configure \
  --prefix=/usr/pppl/pgi/17.3-pkgs/openmpi-1.10.3 \
  --disable-silent-rules \
  --enable-shared \
  --enable-static \
  --enable-mpi-thread-multiple \
  --with-pmi=/usr/pppl/slurm/15.08.8 \
  --with-hwloc \
  --with-verbs \
  --with-slurm \
  --with-psm \
  CC=pgcc \
  CFLAGS="-tp x64 -fast" \
  CXX=pgc++ \
  CXXFLAGS="-tp x64 -fast" \
  FC=pgfortran \
  FCFLAGS="-tp x64 -fast" \
  2>&1 | tee configure.log

Which leads to this error  from libtool during make:

pgcc-Error-Unknown switch: -pthread

I've searched the archives, which ultimately lead to this work
around from 2009:

https://www.open-mpi.org/community/lists/users/2009/04/8724.php
<https://www.open-mpi.org/community/lists/users/2009/04/8724.php>

Interestingly, I participated in the discussion that lead to that
workaround, stating that I had no problem compiling Open MPI with
PGI v9. I'm assuming the problem now is that I'm specifying
--enable-mpi-thread-multiple, which I'm doing because a user
requested that feature.

It's been exactly 8 years and 2 days since that workaround was
posted to the list. Please tell me a better way of dealing with
this issue than writing a 'fakepgf90' script. Any suggestions?


--
Prentice

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org

Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-03 Thread Prentice Bisbal


This is the second suggestion to rebuild Slurm

The  other from Åke Sandgren, who recommended this:


This usually comes from slurm, so we always do

perl -pi -e 's/-pthread//'/lap/slurm/${version}/lib/libpmi.la
/lap/slurm/${version}/lib/libslurm.la

when installing a new slurm version. Thus no need for a fakepg wrapper.


I don't really have the luxury to rebuild Slurm at the moment. How would 
I rebuild Slurm to change this behavior? Is rebuilding Slurm with PGI 
the only option to fix this in slurm, or use Åke's suggestion above?


If I did use Åke's suggestion above, how would that affect the operation 
of Slurm, or future builds of OpenMPI and any other software that might 
rely on Slurm, particulary with regards to building those apps with 
non-PGI compilers?


Prentice

On 04/03/2017 10:31 AM, Gilles Gouaillardet wrote:

Hi,

The -pthread flag is likely pulled by libtool from the slurm libmpi.la 
<http://libmpi.la> and/or libslurm.la <http://libslurm.la>

Workarounds are
- rebuild slurm with PGI
- remove the .la files (*.so and/or *.a are enough)
- wrap the PGI compiler to ignore the -pthread option

Hope this helps

Gilles

On Monday, April 3, 2017, Prentice Bisbal <pbis...@pppl.gov 
<mailto:pbis...@pppl.gov>> wrote:


Greeting Open MPI users! After being off this list for several
years, I'm back! And I need help:

I'm trying to compile OpenMPI 1.10.3 with the PGI compilers,
version 17.3. I'm using the following configure options:

./configure \
  --prefix=/usr/pppl/pgi/17.3-pkgs/openmpi-1.10.3 \
  --disable-silent-rules \
  --enable-shared \
  --enable-static \
  --enable-mpi-thread-multiple \
  --with-pmi=/usr/pppl/slurm/15.08.8 \
  --with-hwloc \
  --with-verbs \
  --with-slurm \
  --with-psm \
  CC=pgcc \
  CFLAGS="-tp x64 -fast" \
  CXX=pgc++ \
  CXXFLAGS="-tp x64 -fast" \
  FC=pgfortran \
  FCFLAGS="-tp x64 -fast" \
  2>&1 | tee configure.log

Which leads to this error  from libtool during make:

pgcc-Error-Unknown switch: -pthread

I've searched the archives, which ultimately lead to this work
around from 2009:

https://www.open-mpi.org/community/lists/users/2009/04/8724.php
<https://www.open-mpi.org/community/lists/users/2009/04/8724.php>

Interestingly, I participated in the discussion that lead to that
workaround, stating that I had no problem compiling Open MPI with
PGI v9. I'm assuming the problem now is that I'm specifying
--enable-mpi-thread-multiple, which I'm doing because a user
requested that feature.

It's been exactly 8 years and 2 days since that workaround was
posted to the list. Please tell me a better way of dealing with
this issue than writing a 'fakepgf90' script. Any suggestions?


-- 
Prentice


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-03 Thread Prentice Bisbal

Greeting Open MPI users! After being off this list for several years, 
I'm back! And I need help:


I'm trying to compile OpenMPI 1.10.3 with the PGI compilers, version 
17.3. I'm using the following configure options:


./configure \
  --prefix=/usr/pppl/pgi/17.3-pkgs/openmpi-1.10.3 \
  --disable-silent-rules \
  --enable-shared \
  --enable-static \
  --enable-mpi-thread-multiple \
  --with-pmi=/usr/pppl/slurm/15.08.8 \
  --with-hwloc \
  --with-verbs \
  --with-slurm \
  --with-psm \
  CC=pgcc \
  CFLAGS="-tp x64 -fast" \
  CXX=pgc++ \
  CXXFLAGS="-tp x64 -fast" \
  FC=pgfortran \
  FCFLAGS="-tp x64 -fast" \
  2>&1 | tee configure.log

Which leads to this error  from libtool during make:

pgcc-Error-Unknown switch: -pthread

I've searched the archives, which ultimately lead to this work around 
from 2009:


https://www.open-mpi.org/community/lists/users/2009/04/8724.php

Interestingly, I participated in the discussion that lead to that 
workaround, stating that I had no problem compiling Open MPI with PGI 
v9. I'm assuming the problem now is that I'm specifying 
--enable-mpi-thread-multiple, which I'm doing because a user requested 
that feature.


It's been exactly 8 years and 2 days since that workaround was posted to 
the list. Please tell me a better way of dealing with this issue than 
writing a 'fakepgf90' script. Any suggestions?



--
Prentice

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Building openmpi from src rpm: rpmbuild --rebuild errors with 'cpio: MD5 sum mismatch' (since openmpi 1.4.5)

2012-06-06 Thread Prentice Bisbal

On 05/31/2012 07:26 AM, Jeff Squyres wrote:
> On May 31, 2012, at 2:04 AM, livelfs wrote:
>
>> Since 1.4.5 openmpi release, it is no longer possible to build openmpi 
>> binary with rpmbuild --rebuild if system rpm package version is 4.4.x, like 
>> in SLES10, SLES11, RHEL/CentOS 5.x.
>>
>> For instance, on CentOS 5.8 x86_64 with rpm 4.4.2.3-28.el5_8:
>>
>> [root@horizon _tmp]# rpmbuild --rebuild openmpi-1.4.5-1.src.rpm
>> Installing openmpi-1.4.5-1.src.rpm
>> warning: user jsquyres does not exist - using root
>> error: unpacking of archive failed on file 
>> /usr/src/redhat/SPECS/openmpi-1.4.5.spec;4fc65c74: cpio: MD5 sum mismatch
>> error: openmpi-1.4.5-1.src.rpm cannot be installed
>>
>> Apparently this problem is due to lack of support of SHA-256 in rpm 4.4.x
> Mmmm.  I wonder if this corresponds to me upgrading my cluster (where I make 
> the SRPM) from RHEL5 to RHEL6.  I'll bet it does.  :-\
>
> Just curious -- do you know if there's a way I can make an RHEL5-friendly 
> SRPM on my RHEL6 cluster?  I seem to have RPM 4.8.0 on my RHEL6 machines.
>
> Or, better yet, perhaps I should be producing the SRPM on the official OMPI 
> build machine (i.e., where we make our tarballs), which is still back at 
> RHEL4.  I'm not quite sure how it evolved that we make tarballs in tightly 
> controlled conditions, but the SRPM is just made by hand on my cluster (which 
> is subject to upgrades, etc.).  Hrm. :-\
>

Building on RHEL 4 shouldn't have any impact. If anything, it would make
things worse instead of better, but I think that's unlikely. This
problem has to do with changes in RPM itself from RHEL5 to RHEL 6.
Ideally, you should be using Mock to build your RPMs, and build a
separate set of RPMs for RHEL 3,4,5,6,... It's a PITA, I know, but it's
really the best way to build RPMs without any dependency gotchas.

--
Prentice

Re: [OMPI users] Building openmpi from src rpm: rpmbuild --rebuild errors with 'cpio: MD5 sum mismatch' (since openmpi 1.4.5)

2012-06-06 Thread Prentice Bisbal

On 05/31/2012 02:04 AM, livelfs wrote:
> Hi
> Since 1.4.5 openmpi release, it is no longer possible to build openmpi
> binary with rpmbuild --rebuild if system rpm package version is 4.4.x,
> like in SLES10, SLES11, RHEL/CentOS 5.x.
>
> For instance, on CentOS 5.8 x86_64 with rpm 4.4.2.3-28.el5_8:
>
> [root@horizon _tmp]# rpmbuild --rebuild openmpi-1.4.5-1.src.rpm
> Installing openmpi-1.4.5-1.src.rpm
> warning: user jsquyres does not exist - using root
> error: unpacking of archive failed on file
> /usr/src/redhat/SPECS/openmpi-1.4.5.spec;4fc65c74: cpio: MD5 sum mismatch
> error: openmpi-1.4.5-1.src.rpm cannot be installed
>
> Apparently this problem is due to lack of support of SHA-256 in rpm 4.4.x
>
> Googling suggests
>   rpmbuild -bs \
>--define "_source_filedigest_algorithm md5" \
>--define "_binary_filedigest_algorithm md5" \
>package.spec
> should be used to produce openmpi src rpms and avoid the problem.
>
> Please note that
> - rpmbuild works OK on RHEL/CentOS 5.x with openmpi-1.4.4-1.src.rpm
> and all previous versions
> - rpmbuild works OK on with all openmpi versions with rpm 4.8.x from
> RHEL/CentOS 6.x
> - this is of course not blocking, since I successfully tested 2
> workarounds
> 1) install package with --nomd5, then rpmbuild -ba 
> 2) repackage with "old" rpm:
> rpm2cpio to extract spec file + sources tar
> rpmbuild -bs  to produce new src rpm
> Then rpmbuild --rebuild is OK
>
>

This is a known "problem" with RHEL 6 that burned me, too. I say
"problem" in quotes because in my case, it only appeared when I tried to
install RPMS built for RHEL 5 on a RHEL 6 system. That's a problem to
me, but some purists don't see this is a problem and just say "Well,
that's what you get for trying to install RHEL 5 RPMs on a RHEL 6
system. I don't agree with them.

As a work around, i think I did some magic with rpm2cpio, as documented
above, but I don't remember the details.

--
Prentice

Re: [OMPI users] regarding the problem occurred while running anmpi programs

2012-04-26 Thread Prentice Bisbal

Actually, he should leave the ":$LD_LIBRARY_PATH" on the end. That way
if LD_LIBRARY_PATH is already defined, the Open MPI directory is just
prepended to LD_LIBRARY_PATH. Omitting ":$LD_LIBRARY_PATH" from his
command could cause other needed elements of LD_LIBRARY_PATH to be lost,
causing other runtime errors.

--
Prentice



On 04/25/2012 11:48 AM, tyler.bal...@huskers.unl.edu wrote:
> export LD_LIBRARY_PATH= [location of library] leave out
> the :$LD_LIBRARY_PATH 
> 
> *From:* users-boun...@open-mpi.org [users-boun...@open-mpi.org] on
> behalf of seshendra seshu [seshu...@gmail.com]
> *Sent:* Wednesday, April 25, 2012 10:43 AM
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] regarding the problem occurred while
> running anmpi programs
>
> Hi
> I have exported the library files as below
>
> [master@ip-10-80-106-70 ~]$ export
> LD_LIBRARY_PATH=/usr/local/openmpi-1.4.5/lib:$LD_LIBRARY_PATH 
>   
> [master@ip-10-80-106-70 ~]$ mpirun --prefix /usr/local/openmpi-1.4.5
> -n 1 --hostfile hostfile out
> out: error while loading shared libraries: libmpi_cxx.so.0: cannot
> open shared object file: No such file or directory
> [master@ip-10-80-106-70 ~]$ mpirun --prefix /usr/local/lib/ -n 1
> --hostfile hostfile
> out   
> 
> out: error while loading shared libraries: libmpi_cxx.so.0: cannot
> open shared object file: No such file or directory
>
> But still iam getting the same error.
>
>
>
>
>
> On Wed, Apr 25, 2012 at 5:36 PM, Jeff Squyres (jsquyres)
> > wrote:
>
> See the FAQ item I cited. 
>
> Sent from my phone. No type good. 
>
> On Apr 25, 2012, at 11:24 AM, "seshendra seshu"
> > wrote:
>
>> Hi
>> now i have created an used and tried to run the program but i got
>> the following error
>>
>> [master@ip-10-80-106-70 ~]$ mpirun -n 1 --hostfile hostfile
>> out  
>>   
>> out: error while loading shared libraries: libmpi_cxx.so.0:
>> cannot open shared object file: No such file or directory
>>
>>
>> thanking you
>>
>>
>>
>> On Wed, Apr 25, 2012 at 5:12 PM, Jeff Squyres > > wrote:
>>
>> On Apr 25, 2012, at 11:06 AM, seshendra seshu wrote:
>>
>> > so should i need to create an user and run the mpi program.
>> or how can i run in cluster
>>
>> It is a "best practice" to not run real applications as root
>> (e.g., MPI applications).  Create a non-privlidged user to
>> run your applications.
>>
>> Then be sure to set your LD_LIBRARY_PATH if you installed
>> Open MPI into a non-system-default location.  See this FAQ item:
>>
>>  
>>  http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com 
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org 
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>>
>> -- 
>>  WITH REGARDS
>> M.L.N.Seshendra
>> ___
>> users mailing list
>> us...@open-mpi.org 
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org 
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> -- 
>  WITH REGARDS
> M.L.N.Seshendra
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] redirecting output

2012-04-02 Thread Prentice Bisbal

On 03/30/2012 11:12 AM, Tim Prince wrote:
>  On 03/30/2012 10:41 AM, tyler.bal...@huskers.unl.edu wrote:
>>
>>
>> I am using the command mpirun -np nprocs -machinefile machines.arch
>> Pcrystal and my output strolls across my terminal I would like to
>> send this output to a file and I cannot figure out how to do soI
>> have tried the general > FILENAME and > log & these generate
>> files however they are empty.any help would be appreciated.

If you see the output on your screen, but it's not being redirected to a
file, it must be printing to STDERR and not STDOUT. The '>' by itself
redirects STDOUT only, so it doesn't redirect error messages. To
redirect STDERR, you can use '2>', which says redirect filehandle # 2,
which is stderr.

some_command 2> myerror.log

or

some_command >myoutput.log 2>myerror.log

 To redirect both STDOUT and STDERR to the same place, use the syntax
"2>&1" to tie STDERR to STDOUT:

some_command > myoutput.log 2>&1

I prefer to see the ouput on the screen at the same time I write it to a
file. That way, if the command hangs for some reason, I know it
immediately. I find the 'tee' command priceless for this:

some_command 2>&1 | tee myoutput.log

Google for 'bash output redirection' and you'll find many helpful pages
with better explanation and examples, like this one:

http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-3.html

If you don't you bash, those results will be much less helpful.

I hope that helps, or at least gets you pointed in the right direction.

--
Prentice

>
> If you run under screen your terminal output should be collected in
> screenlog.  Beats me why some sysadmins don't see fit to install screen.
>

Re: [OMPI users] Simple question on GRID

2012-03-02 Thread Prentice Bisbal

On 03/01/2012 12:10 AM, Shaandar Nyamtulga wrote:
> Hi
> I have two Beowulf clusters (both Ubuntu 10.10, one is OpenMPI, one is
> MPICH2).
> They run separately in their local network environment.I know there is
> a way to integrate them through Internet, presumably by Grid software,
> I guess. Is there any tutorial to do this?
>  
>

This question is a little off-topic for this list, since this list is
for Open MPI-specific questions (and some general MPI questions). You
should really ask this question on the Beowulf mailing list, which
covers any and all topics related to HPC clustering. See www.beowulf.org
for more information.

Also, you need to be more specifc as to what you really want to do
"integrate" is a vague, overused term. Do you want the scheduler at one
site to be able to manage jobs on the cluster at the other site with no
message-passing traffic between sites? That might be possible.

Or, do you want the two remote clusters to send message-passing traffic
back-and-forth over the internet and behave as a single cluster? That
might be possible, too, but due to the latency and reduced bandwidth of
sending those messages  over the internet,  the performance would be so
poor as to probably not be worth it.

--
Prentice

Re: [OMPI users] ssh between nodes

2012-03-02 Thread Prentice Bisbal

On 02/29/2012 04:51 PM, Martin Siegert wrote:
> Hi,
>
> On Wed, Feb 29, 2012 at 09:09:27PM +, Denver Smith wrote:
>>Hello,
>>On my cluster running moab and torque, I cannot ssh without a password
>>between compute nodes. I can however request multiple node jobs fine. I
>>was wondering if passwordless ssh keys need to be set up between
>>compute nodes in order for mpi applications to run correctly.
>>Thanks
> No. passwordless ssh keys are not needed. In fact, I strong advise
> against using those (teaching users how to generate passwordless
> ssh keys creates security problems: they start using those not just
> for connecting to compute nodes). There are several alternatives:
>
> 1) use openmpi's hooks into torque (use the --with-tm configure option);
> 2) use ssh hostbased authentication (and set IgnoreUserKnownHosts to yes);
> 3) use rsh (works if your cluster is sufficiently small).

What has been said for Torque also holds true for SGE - if you compile
Open MPI with the  --with-sge switch, passwordless SSH is not needed
since Open MPI will work directly with SGE .

And as much as I agree passwordless SSH keys are not desirable, they can
be difficult to avoid., especially if you use commercial software on
your cluster. MATLAB, for example requires passwordless SSH between
cluster nodes in order to work.

--
Prentice.

Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1

2012-03-02 Thread Prentice Bisbal


On 02/29/2012 03:15 PM, Jeffrey Squyres wrote:
> On Feb 29, 2012, at 2:57 PM, Jingcha Joba wrote:
>
>> So if I understand correctly, if a message size is smaller than it will use 
>> the MPI way (non-RDMA, 2 way communication), if its larger, then it would 
>> use the Open Fabrics, by using the ibverbs (and ofed stack) instead of using 
>> the MPI's stack?
> Er... no.
>
> So let's talk MPI-over-OpenFabrics-verbs specifically.
>
> All MPI communication calls will use verbs under the covers.  They may use 
> verbs send/receive semantics in some cases, and RDMA semantics in other 
> cases.  "It depends" -- on a lot of things, actually.  It's hard to come up 
> with a good rule of thumb for when it uses one or the other; this is one of 
> the reasons that the openib BTL code is so complex.  :-)
>
> The main points here are:
>
> 1. you can trust the openib BTL to do the Best thing possible to get the 
> message to the other side.  Regardless of whether that message is an MPI_SEND 
> or an MPI_PUT (for example).
>
> 2. MPI_PUT does not necessarily == verbs RDMA write (and likewise, MPI_GET 
> does not necessarily == verbs RDMA read).
>
>> If so, could that be the reason why the MPI_Put "hangs" when sending a 
>> message more than 512KB (or may be 1MB)?
> No.  I'm guessing that there's some kind of bug in the MPI_PUT implementation.
>
>> Also is there a way to know if for a particular MPI call, OF uses send/recv 
>> or RDMA exchange?
> Not really.
>
> More specifically: all things being equal, you don't care which is used.  You 
> just want your message to get to the receiver/target as fast as possible.  
> One of the main ideas of MPI is to hide those kinds of details from the user. 
>  I.e., you call MPI_SEND.  A miracle occurs.  The message is received on the 
> other side.
>
> :-)
>

Nice use of the "A Miracle Occurs" meme. We really need t-shirts that
say this for the OpenMPI BoF at SC12.

--
Prentice

Re: [OMPI users] [Open MPI Announce] Open MPI v1.4.5 released

2012-02-16 Thread Prentice Bisbal


On 02/15/2012 07:44 AM, Reuti wrote:
> Hi,
>
> Am 15.02.2012 um 03:48 schrieb alexalex43210:
>
>>   But I am a novice for the parallel computation, I often use Fortran to 
>> compile my program, now I want to use the Parallel, can you give me some 
>> help how to begin?
>>   PS: I learned about OPEN MPI is the choice for my question solution. am I 
>> right?
> This depends on your application and how easy it can be adopted to split the 
> problem into smaller parts. It could also be the case, that you want to stay 
> on one node only to use additional cores and could parallelize it better by 
> using OpenMP, where all threads operate on the same memory area on a single 
> node.
>
> http://openmp.org/wp/
>
> It's built into many compilers by default nowadays.
>
> In addition to the online courses Jeff mentioned there are several books 
> available like Parallel Programming with MPI by Peter Pacheco (although it 
> covers only MPI-1 due to its age http://www.cs.usfca.edu/~peter/ppmpi/), 
> Parallel Programming in C with MPI and OpenMP by Michael Quinn.
>

Personally, I didn't like Peter Pacheco's book all that much, so I'd
like to add a couple more books to this list of recommendations:

Using MPI: Portable Parallel Programming with the Message-Passing
Interface, 2nd Edition
William Gropp, Ewing Lusk, and Anthony Skjellum
Copyright 1997, MIT Press
http://www.mcs.anl.gov/research/projects/mpi/usingmpi/

Using MPI-2: Advanced Features of the Message-Passing Interface
William Gropp, Ewing Lusk, Rajeev Thakur
Copyright 1997, MIT Press
http://www.mcs.anl.gov/research/projects/mpi/usingmpi2/index.html

The second book covers the more advanced features of MPI-2.  As a n00b
just learning MPI, you probably don't need to learn that stuff until
you've mastered the material in the first book I listed,  or Pacheco's
book. 

--
Prentice

Re: [OMPI users] OpenMPI: How many connections?

2012-01-27 Thread Prentice Bisbal

I would like to nominate the quote below for the best explanation of how
a piece of software works  that I've ever read.

Kudos, Jeff.

On 01/26/2012 04:38 PM, Jeff Squyres wrote:
> You send a message, a miracle occurs, and the message is received on the 
> other side. 

--
Prentice

Re: [OMPI users] ompi + bash + GE + modules

2012-01-13 Thread Prentice Bisbal


On 01/12/2012 08:40 AM, Dave Love wrote:
> Surely this should be on the gridengine list -- and it's in recent
> archives -- but there's some ob-openmpi below.  Can Notre Dame not get
> the support they've paid Univa for?

This is, in fact, in the recent gridengine archives. I brought up this
problem myself within the past couple of months ago.

> Reuti  writes:
>
>> SGE 6.2u5 can't handle multi line environment variables or functions,
>> it was fixed in 6.2u6 which isn't free.
> [It's not listed for 6.2u6.]  For what it's worth, my fix for Sun's fix
> is https://arc.liv.ac.uk/trac/SGE/changeset/3556/sge.
>
>> Do you use -V while submitting the job? Just ignore the error or look
>> into Son of Gridengine which fixed it too.
> Of course
> you can always avoid the issue by not using `export -f', which isn't in
> the modules version we have.  I default -V in sge_request and load
> the open-mpi module in the job submission session.  I don't
> fin whatever problems it causes, and it works for binaries like
>   qsub -b y ... mpirun ...
> However, the folkloristic examples here typically load the module stuff
> in the job script.
>
>> If you can avoid -V, then it could be defined in any of the .profile
>> or alike if you use -l as suggested.  You could even define a
>> started_method in SGE to define it for all users by default and avoid
>> to use -V:
>>
>> #!/bin/sh
>> module() { ...command...here... }
>> export -f module
>> exec "${@}"
> That won't work for example if someone is tasteless enough to submit csh.
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Installation of openmpi-1.4.4

2011-12-21 Thread Prentice Bisbal

Is the path to your opempi libraries in your LD_LIBRARY_PATH?

--
Prentice


On 12/21/2011 01:56 PM, amosl...@gmail.com wrote:
> Dear OMPI Users,
>   I have just read the messages from Martin Rushton and Jeff
> Squyres and have been having the same problem trying to get
> openmp-1.4.4 to work.  My specs are below:
>Xeon(R) CPU 5335 2.00 GHz
>Linux  SUSE 11.4 (x86_64)
>Linux 2.6.371-1.2 desktop x86_64
> I go through the compilation process with the commands:
>   ./configure --prefix=/opt/openmpi CC=icc
> CXX=icpc F77=ifort F90=ifort "FCFLAGS=-O3 -i8" "FFLAGS=-O3 -i8" 2>&1 |
> tee config.out
>make -j 4 all 2>&1 | tee make.out
>make install 2>&1 | tee install.out.
> The entire process seems to go properly but when I try to use an
> example it doesn't work properly.
>mpicc hello_c.c -o hello_c
> compiles properly.  However,
>"./hello_c" gives an error message that it
> cannot find the file libmpi_so.0.There are at least 3 copies of
> the file present as found by the search command but none of these are
> found.  I have checked the permissions and they seem to be OK so I am
> at the same point as Martin Rushton.  I hope that somebody comes up
> with an anser soon.
>   
>
> Amos Leffler
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] openmpi - gfortran and ifort conflict

2011-12-14 Thread Prentice Bisbal


On 12/14/2011 03:39 PM, Jeff Squyres wrote:
> On Dec 14, 2011, at 3:21 PM, Prentice Bisbal wrote:
>
>> For example, your configure command,
>>
>> ./configure --prefix=/opt/openmpi/intel CC=gcc CXX=g++ F77=ifort FC=ifort
>>
>> Doesn't tell Open MPI to use ifcort for mpif90 and mpif77.
> Actually, that's not correct.
>
> For Open MPI, our wrapper compilers will default to using the same compilers 
> that were used to build Open MPI.  So in the above case:
>
> mpicc will use gcc
> mpicxx will use g++
> mpif77 will use ifort
> mpif90 will use ifort
>
>

Jeff,

I realized this after I wrote that and clarified it in a subsequent
e-mail. Which you probably just read. ;-)

Prentice

Re: [OMPI users] openmpi - gfortran and ifort conflict

2011-12-14 Thread Prentice Bisbal

On 12/14/2011 03:29 PM, Micah Sklut wrote:
> Okay thanks Prentice.
>
> I understand what you are saying about specifying the compilers during
> configure.
> Perhaps, that alone would have solved the problem, but removing the
> 1.4.2 ompi installation worked as well.
>
> Micah
>

Well, to clarify my earlier statement, those compilers used during
installation are used to set the defaults in the wrapper files
(mpif90-wrapper--data.txt, etc.), but those
can easily be changed, either by editing those files, or by defining
environment variables.

Anywhow, we're all glad you were finally able to solve your problem.

--
Prentice

Re: [OMPI users] openmpi - gfortran and ifort conflict

2011-12-14 Thread Prentice Bisbal


On 12/14/2011 01:20 PM, Fernanda Oliveira wrote:
> Hi Micah,
>
> I do not know if it is exactly what you need but I know that there are
> environment variables to use with intel mpi. They are: I_MPI_CC,
> I_MPI_CXX, I_MPI_F77, I_MPI_F90. So, you can set this using 'export'
> for bash, for instance or directly when you run.
>
> I use in my bashrc:
>
> export I_MPI_CC=icc
> export I_MPI_CXX=icpc
> export I_MPI_F77=ifort
> export I_MPI_F90=ifort

Those environment variables are for Intel MPI.  For OpenMPI, the
equivalent variables would be OMPI_CC, OMPI_CXX, OMPI_F77, and OMPI_FC,
respectively.

--
Prentice

Re: [OMPI users] openmpi - gfortran and ifort conflict

2011-12-14 Thread Prentice Bisbal


On 12/14/2011 12:21 PM, Micah Sklut wrote:
> Hi Gustav,
>
> I did read Price's email:
>
> When I do "which mpif90", i get:
> /opt/openmpi/intel/bin/mpif90
> which is the desired directory/binary
>
> As I mentioned, the config log file indicated it was using ifort, and
> had no mention of gfortran.
> Below is the output from ompi_info. It shows reference to the correct
> ifort compiler. But, yet the mpif90 compiler, still yeilds a gfortran
> compiler.

Micah,

You are confusing the compilers users to build Open MPI  itself with the
compilers used by Open MPI to compile other codes with the proper build
environment.

For example, your configure command,

./configure --prefix=/opt/openmpi/intel CC=gcc CXX=g++ F77=ifort FC=ifort

Doesn't tell Open MPI to use ifcort for mpif90 and mpif77. It tell the
build process to use ifort to compile the Fortran sections of the Open
MPI source code. To tell mpif90 and mpif77 which compilers you'd like to
use to compile Fortran programs that use Open MPI, you must set the
environment variables OMPI_F77 and OMPI_F90. To illustrate, when I want
to use the gnu compilers, I set the following in my .bashrc:

export OMPI_CC=gcc
export OMPI_CXX=g++
export OMPI_F77=gfortran
export OMPI_FC=gfortran

If I wanted to use Intel compilers, swap the above 4 lines for this:

export OMPI_CC=pgcc
export OMPI_CXX=pgCC
export OMPI_F77=pgf77
export OMPI_FC=pgf95

You can verify which compiler is set using the --showme switch to mpif90:

$ mpif90 --showme
pgf95 -I/usr/local/openmpi-1.2.8/pgi-8.0/x86_64/include
-I/usr/local/openmpi-1.2.8/pgi-8.0/x86_64/lib -L/usr/lib64
-L/usr/local/openmpi-1.2.8/pgi/x86_64/lib
-L/usr/local/openmpi-1.2.8/pgi-8.0/x86_64/lib -lmpi_f90 -lmpi_f77 -lmpi
-lopen-rte -lopen-pal -libverbs -lrt -lnuma -ldl -Wl,--export-dynamic
-lnsl -lutil -lpthread -ldl

I suspect if you run the command ' env | grep OMPI_FC', you'll see that
you have it set to gfortran. I can verify that mine is set to pgf97 this
way:

$ env | grep OMPI_FC
OMPI_FC=pgf95

Of course, a simple echo would work, too:

$ echo $OMPI_FC
pgf95

You can also change these setting by editing the file
mpif90-wrapper-data.txt in your Open MPI installation directory.

Full details on setting these variables (and others) can be found in the
FAQ:

http://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0

--
Prentice



> -->
> barells@ip-10-17-153-123:~> ompi_info
>  Package: Open MPI barells@ip-10-17-148-204 Distribution
> Open MPI: 1.4.4
>Open MPI SVN revision: r25188
>Open MPI release date: Sep 27, 2011
> Open RTE: 1.4.4
>Open RTE SVN revision: r25188
>Open RTE release date: Sep 27, 2011
> OPAL: 1.4.4
>OPAL SVN revision: r25188
>OPAL release date: Sep 27, 2011
> Ident string: 1.4.4
>   Prefix: /usr/lib64/mpi/gcc/openmpi
>  Configured architecture: x86_64-unknown-linux-gnu
>   Configure host: ip-10-17-148-204
>Configured by: barells
>Configured on: Wed Dec 14 14:22:43 UTC 2011
>   Configure host: ip-10-17-148-204
> Built by: barells
> Built on: Wed Dec 14 14:27:56 UTC 2011
>   Built host: ip-10-17-148-204
>   C bindings: yes
> C++ bindings: yes
>   Fortran77 bindings: yes (all)
>   Fortran90 bindings: yes
>  Fortran90 bindings size: small
>   C compiler: gcc
>  C compiler absolute: /usr/bin/gcc
> C++ compiler: g++
>C++ compiler absolute: /usr/bin/g++
>   Fortran77 compiler: ifort
>   Fortran77 compiler abs: /opt/intel/fce/9.1.040/bin/ifort
>   Fortran90 compiler: ifort
>   Fortran90 compiler abs: /opt/intel/fce/9.1.040/bin/ifort
>  C profiling: yes
>C++ profiling: yes
>  Fortran77 profiling: yes
>  Fortran90 profiling: yes
>   C++ exceptions: no
>   Thread support: posix (mpi: no, progress: no)
>Sparse Groups: no
>   Internal debug support: no
>  MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
>  libltdl support: yes
>Heterogeneous support: no
>  mpirun default --prefix: no
>  MPI I/O support: yes
>MPI_WTIME support: gettimeofday
> Symbol visibility support: yes
>FT Checkpoint support: no  (checkpoint thread: no)
>MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.4.2)
>   MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.4.2)
>MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4.2)
>MCA carto: auto_detect (MCA v2.0, API v2.0, Component
> v1.4.2)
>MCA carto: file (MCA v2.0, API v2.0, Component v1.4.2)
>MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4.2)
>MCA timer: linux (MCA v2.0, API v2.0, Component v1.4.2)
>  MCA installdirs: env (MCA v2.0, API v2.0,

Re: [OMPI users] wiki and "man mpirun" odds, and a question

2011-11-10 Thread Prentice Bisbal

Paul,

I'm sure this isn't the response you want to hear, but I'll suggest it
anyway:

Queuing systems can forward the submitters environment if desired. For
example, in SGE, the -V switch forwards all the environment variables to
the job's environment, so if there's one you can use to launch your job,
you might want to check it's documentation.

--
Prentice 

On 11/10/2011 08:01 AM, Ralph Castain wrote:
> I'm not sure where the FAQ got its information, but it has always been one 
> param per -x option.
>
> I'm afraid there isn't any envar to support the setting of multiple -x 
> options. We didn't expect someone to forward very many, if any, so we didn't 
> create that capability. It wouldn't be too hard to convert it to an mca 
> param, though, so you could add such options to your mca param file, if that 
> would help.
>
>
> On Nov 10, 2011, at 4:02 AM, Paul Kapinos wrote:
>
>> Hi folks,
>> I.  looked for ways to tell to "mpiexec" to forward some environment 
>> variables, I saw a mismatch:
>>
>> ---
>> http://www.open-mpi.org/faq/?category=running#mpirun-options
>> ...
>> --x : A comma-delimited list of environment variables to 
>> export to the parallel application.
>> ---
>> (Open MPI/1.5.3)
>> $ man mpirun
>>   -x 
>>  Export  the  specified environment variables to the remote 
>> nodes before executing the program.  Only one environment variable can
>>^^^
>> be  specified per -x option.
>> ---
>>
>> So, either the info is outdated somewhre, or -x and --x have different 
>> meaning - but then there is a lack of info, too :o)
>>
>> Maybe you could update the Wiki and/or the man page?
>>
>> II. Now the question. Defaultly no non-OpenMPI environmet variables are 
>> exported to the parallel application, AFAIK.
>>
>> With -x option of mpiexec it is possible to export one (or a list of, see 
>> below) environment variable. But, it's a bit tedious to type a [long] list 
>> of variables.
>>
>> Is there someone envvar, by setting which to a list of names of other 
>> envvars the same effect could be achieved as by setting -x on command line 
>> of mpirun?
>>
>> Best wishes
>> Paul Kapinos
>>
>>
>> -- 
>> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
>> RWTH Aachen University, Center for Computing and Communication
>> Seffenter Weg 23,  D 52074  Aachen (Germany)
>> Tel: +49 241/80-24915
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Role of ethernet interfaces of startup of openmpi job using IB

2011-09-28 Thread Prentice Bisbal

On 09/27/2011 05:30 PM, Jeff Squyres wrote:
> On Sep 27, 2011, at 5:03 PM, Prentice Bisbal wrote:
> 
>> To clarify, is IP/Ethernet required, or will IPoIB be used if it's
>> configured on the nodes? Would this make a difference.
> 
> IPoIB is fine, although I've heard concerns about its stability at scale.
> 
> The difference that it'll make is that it's generally faster than ethernet.  
> It never runs at wire IB speed because of the overheads involved, but it's 
> likely to be much faster than 1GB ethernet, for example.
> 
> You can specify which interfaces Open MPI's OOB channel uses with the 
> oob_tcp_if_include MCA parameter.  For example:
> 
>mpirun --mca oob_tcp_if_include ib0 ...
> 

Jeff,

Thanks for the clarification. I was just checking. Earlier in this
thread you specifically said "ethernet". I suspected you meant "IP", and
just wanted to be sure.

Re: [OMPI users] Role of ethernet interfaces of startup of openmpi job using IB

2011-09-27 Thread Prentice Bisbal


On 09/27/2011 07:50 AM, Jeff Squyres wrote:
> On Sep 27, 2011, at 6:35 AM, Salvatore Podda wrote:
> 
>>  We would like to know if the ethernet interfaces play any role in the 
>> startup phase of an opempi job using InfiniBand
>> In this case, where we can found some literature on this topic?
> 
> Unfortunately, there's not a lot of docs about this other than people asking 
> questions on this list.
> 
> IP is used by default during Open MPI startup.  Specifically, it is used as 
> our "out of band" communication channel for things like stdin/stdout/stderr 
> redirection, launch command relaying, process control, etc.  The OOB channel 
> is also used by default for bootstrapping IB queue pairs.

To clarify, is IP/Ethernet required, or will IPoIB be used if it's
configured on the nodes? Would this make a difference.

Just curious,
Prentice

Re: [OMPI users] Anyone with Visual Studio + MPI Experience

2011-07-07 Thread Prentice Bisbal

Miguel,

Thanks for the assistance. I don't have the MPI options you spoke of, so
I figured that might have been part of the HPC Pack. I found a couple of
web pages that helped me make progress. I'm not 100% there, but I'm much
closer, say 85% of the way there.

Now I can get an Fortran+MPI program to run with a single click, but
then I get an error that's OpenMPI-related. The same program runs from
the command-line, so I think it's just a matter of me making sure some
environment variables are set correctly. It turns out the user I'm doing
this for will be away for 6 weeks, so this is no longer the priority it
was a few days ago.

Prentice

On 07/07/2011 01:47 PM, Miguel Vargas Felix wrote:
> Prentice,
> 
> I didn't have to install the HPC Pack, as far as I know it is only needed
> when you want to develop/debug in a cluster. I'm sorry I can't help you
> with VS 2010 (I hated it, I switched back to VS 2008), but the
> instructions to configure VS 2010 seems to be similar, check the MPICH2
> guide for Windows developers.
> 
> http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.3.2-windevguide.pdf
> 
> May be this option is not available for Visual Fortran.
> 
> -Miguel
> 
>> Miguel,
>>
>> I'm using VS 2010 Professional + Intel Visual Fortran. I don't have the
>> "Debugger to Launch" option in my version (or I'm looking in the wrong
>> place), and don't see MPI options any where. Do you have any additional
>> software installed, like the HPC Pack 2008?
>>
>> Prentice
>>
>> On 07/04/2011 04:32 PM, Miguel Vargas Felix wrote:
>>>
>>> Hi,
>>>
>>> well, I don't have a lot of experience with VS+MPI, but these are the
>>> steps taht I followed to make my projects run:
>>>
>>> 1. Select your project from the Solution explorer, right-click and
>>> select
>>> "Properties"
>>>
>>> 2. From the list on the left, select "Debugging"
>>>
>>> 3. Set "Debugger to launch" to "MPI Cluster Debugger"
>>>
>>> 4. Set "MPIRun Command" to the full path of your "mpiexec" (use quotes
>>> at
>>> to enclose the path)
>>>
>>> 5. Use "MPIRun Arguments" to set the number of processes to start, like
>>> "-n 4"
>>>
>>> 6. Set "MPIRUN Working Directory" if you need.
>>>
>>> 7. "Application Command" normaly is "$(TargetPath)"
>>>
>>> 8. "Application Arguments" if you need them.
>>>
>>> 9. "MPIShim Location", this is a triky one, for some reason some times
>>> VS
>>> needs the full path for this VS tool. It is located at: "C:\Program
>>> Files\Microsoft Visual Studio 9.0\Common7\IDE\Remote
>>> Debugger\x64\mpishim.exe" or "C:\Program Files\Microsoft Visual Studio
>>> 9.0\Common7\IDE\Remote Debugger\x86\mpishim.exe" (use quotes at to
>>> enclose
>>> the path).
>>>
>>> I haven't played with the other options.
>>>
>>> 10. Close the dialog box.
>>>
>>> 11. Set some breakpoints in your program.
>>>
>>> 12. Ready to run.
>>>
>>> These instructions only work to debug MPI processes on the localhost,
>>> andcommand
>>> I only have tested VS+MPI using MPICH2 for Windows.
>>>
>>> To debug on several nodes you should install the Microsoft HPC SDK (I
>>> haven't used it).
>>>
>>> Good luck.
>>>
>>> -Miguel
>>>
>>> PS. I use Visual Studio 2008 professional. Also, I know that MPI
>>> debugging
>>> is not available in VS Express editions.
>>>
>>>
 Does anyone on this list have experience using MS Visual Studio for MPI
 development? I'm supporting a Windows user who has been doing Fortran
 programming on Windows using an ANCIENT version of Digital Visual
 Fortran (I know, I know - using "ancient" and "Digital" in the same
 sentence is redundant.)

 Well, we are upgrading his equally ancient laptopa new one with Windows
 7, so we installed Intel Visual Fortran (direct descendent of DVF) and
 Visual Studio 2010, and to be honest, I feel like a fish out of water
 using VS 2010. It took me a longer than I care to admit to figure out
 how to specify the include and linker paths.

 Right now, I'm working with the Intel MPI libraries, but plan on
 installing OpenMPI, too, once I figure out VS 2010.

 Can anyone tell me how to configure visual studio so that when you
 click
 on the little "play" icon to build/run the code, it will call mpiexec
 automatically? Right now, it compiles fine, but throws errors when the
 program executes because it doesn't have the right environment setup
 because it's not being executed by mpiexec. It runs fine when I execute
 it with mpiexec or wmpiexec.

 --
 Prentice

>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Anyone with Visual Studio + MPI Experience

2011-07-06 Thread Prentice Bisbal

Miguel,

I'm using VS 2010 Professional + Intel Visual Fortran. I don't have the
"Debugger to Launch" option in my version (or I'm looking in the wrong
place), and don't see MPI options any where. Do you have any additional
software installed, like the HPC Pack 2008?

Prentice

On 07/04/2011 04:32 PM, Miguel Vargas Felix wrote:
> 
> Hi,
> 
> well, I don't have a lot of experience with VS+MPI, but these are the
> steps taht I followed to make my projects run:
> 
> 1. Select your project from the Solution explorer, right-click and select
> "Properties"
> 
> 2. From the list on the left, select "Debugging"
> 
> 3. Set "Debugger to launch" to "MPI Cluster Debugger"
> 
> 4. Set "MPIRun Command" to the full path of your "mpiexec" (use quotes at
> to enclose the path)
> 
> 5. Use "MPIRun Arguments" to set the number of processes to start, like
> "-n 4"
> 
> 6. Set "MPIRUN Working Directory" if you need.
> 
> 7. "Application Command" normaly is "$(TargetPath)"
> 
> 8. "Application Arguments" if you need them.
> 
> 9. "MPIShim Location", this is a triky one, for some reason some times VS
> needs the full path for this VS tool. It is located at: "C:\Program
> Files\Microsoft Visual Studio 9.0\Common7\IDE\Remote
> Debugger\x64\mpishim.exe" or "C:\Program Files\Microsoft Visual Studio
> 9.0\Common7\IDE\Remote Debugger\x86\mpishim.exe" (use quotes at to enclose
> the path).
> 
> I haven't played with the other options.
> 
> 10. Close the dialog box.
> 
> 11. Set some breakpoints in your program.
> 
> 12. Ready to run.
> 
> These instructions only work to debug MPI processes on the localhost, and
> I only have tested VS+MPI using MPICH2 for Windows.
> 
> To debug on several nodes you should install the Microsoft HPC SDK (I
> haven't used it).
> 
> Good luck.
> 
> -Miguel
> 
> PS. I use Visual Studio 2008 professional. Also, I know that MPI debugging
> is not available in VS Express editions.
> 
> 
>> Does anyone on this list have experience using MS Visual Studio for MPI
>> development? I'm supporting a Windows user who has been doing Fortran
>> programming on Windows using an ANCIENT version of Digital Visual
>> Fortran (I know, I know - using "ancient" and "Digital" in the same
>> sentence is redundant.)
>>
>> Well, we are upgrading his equally ancient laptopa new one with Windows
>> 7, so we installed Intel Visual Fortran (direct descendent of DVF) and
>> Visual Studio 2010, and to be honest, I feel like a fish out of water
>> using VS 2010. It took me a longer than I care to admit to figure out
>> how to specify the include and linker paths.
>>
>> Right now, I'm working with the Intel MPI libraries, but plan on
>> installing OpenMPI, too, once I figure out VS 2010.
>>
>> Can anyone tell me how to configure visual studio so that when you click
>> on the little "play" icon to build/run the code, it will call mpiexec
>> automatically? Right now, it compiles fine, but throws errors when the
>> program executes because it doesn't have the right environment setup
>> because it's not being executed by mpiexec. It runs fine when I execute
>> it with mpiexec or wmpiexec.
>>
>> --
>> Prentice
>>
>>
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] mpi & mac

2011-07-06 Thread Prentice Bisbal

On 07/06/2011 10:42 AM, Constantinos Makassikis wrote:
> On Tue, Jul 5, 2011 at 9:48 PM, Robert Sacker  > wrote:
> 
> Hi all,
> 
> Hello !
> 
> I need some help. I'm trying to run C++ code in Xcode on a Mac Pro
> Desktop (OS 10.6) and utilize all 8 cores . My ultimate goal is to
> be able to run the code on the cluster here on campus. I'm in the
> process of converting into C++ the number crunching part of the
> stuff I previously wrote in Matlab. 
> Is there some documentation that explains how to get started?
> Thanks. Bob
> 
> 
> I am not sure whether this is the relevant mailing list for
> general parallelization questions ...

Well, general MPI questions not specific to OpenMPI are not uncommon here.

> 
> In any case, before converting your Matlab code to C++ try using
> parallelization features that come with Matlab.
> 
> Otherwise, after translating your Matlab code to C++, you should
> consider in the first place getting acquainted with OpenMP and
> use it to speed up your code on your 8-core machine.
> OpenMP can be rather straightforward to apply.
> 
> Afterwards, if necessary, you may look into parallelizing over multiple
> machines with OpenMPI.

Why not just use MPI for every step? Open MPI can detect when
communication partners are on the same host and use shared memory for
improved performance. Not sure how this measures up to OpenMP for
intra-node communications, but I imagine it can make the programming
simpler, since only one syntax needs to be learned/used.

As I said, I don't know the performance difference between MPI and
OpenMP, so if someone can shed some light...

Re: [OMPI users] Anyone with Visual Studio + MPI Experience

2011-06-30 Thread Prentice Bisbal

Thanks, Joe.

I did say that, but I meant that in a different way. For program 'foo',
I need to tell Visual Studio that when I click on the 'run' button, I
need it to execute

mpiexec -np X foo

instead of just

foo

I know what I *need* to do to the VS environment, I just don't know
*how* to do it. I've been going through all the settings, but can't find
the magical checkbox or textbox.

Windows is so disorienting. It's like someone went out of their way to
make life as hard as possible for us command-line guys.

Prentice

On 06/30/2011 04:46 PM, Joe Griffin wrote:
> Prentice,
> 
> It might or might not matter, but on your older system you
> may have used "LD_LIBRARY_PATH" but on windows you need "PATH"
> to contain the PATH.
> 
> I only mention this because you said it runs in one environment,
> but not the other.
> 
> Joe
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Prentice Bisbal
> Sent: Thursday, June 30, 2011 1:42 PM
> To: Open MPI Users
> Subject: [OMPI users] Anyone with Visual Studio + MPI Experience
> 
> Does anyone on this list have experience using MS Visual Studio for MPI
> development? I'm supporting a Windows user who has been doing Fortran
> programming on Windows using an ANCIENT version of Digital Visual
> Fortran (I know, I know - using "ancient" and "Digital" in the same
> sentence is redundant.)
> 
> Well, we are upgrading his equally ancient laptopa new one with Windows
> 7, so we installed Intel Visual Fortran (direct descendent of DVF) and
> Visual Studio 2010, and to be honest, I feel like a fish out of water
> using VS 2010. It took me a longer than I care to admit to figure out
> how to specify the include and linker paths.
> 
> Right now, I'm working with the Intel MPI libraries, but plan on
> installing OpenMPI, too, once I figure out VS 2010.
> 
> Can anyone tell me how to configure visual studio so that when you click
> on the little "play" icon to build/run the code, it will call mpiexec
> automatically? Right now, it compiles fine, but throws errors when the
> program executes because it doesn't have the right environment setup
> because it's not being executed by mpiexec. It runs fine when I execute
> it with mpiexec or wmpiexec.
>

[OMPI users] Anyone with Visual Studio + MPI Experience

2011-06-30 Thread Prentice Bisbal

Does anyone on this list have experience using MS Visual Studio for MPI
development? I'm supporting a Windows user who has been doing Fortran
programming on Windows using an ANCIENT version of Digital Visual
Fortran (I know, I know - using "ancient" and "Digital" in the same
sentence is redundant.)

Well, we are upgrading his equally ancient laptopa new one with Windows
7, so we installed Intel Visual Fortran (direct descendent of DVF) and
Visual Studio 2010, and to be honest, I feel like a fish out of water
using VS 2010. It took me a longer than I care to admit to figure out
how to specify the include and linker paths.

Right now, I'm working with the Intel MPI libraries, but plan on
installing OpenMPI, too, once I figure out VS 2010.

Can anyone tell me how to configure visual studio so that when you click
on the little "play" icon to build/run the code, it will call mpiexec
automatically? Right now, it compiles fine, but throws errors when the
program executes because it doesn't have the right environment setup
because it's not being executed by mpiexec. It runs fine when I execute
it with mpiexec or wmpiexec.

-- 
Prentice

Re: [OMPI users] SGE and openmpi

2011-04-07 Thread Prentice Bisbal

On 04/06/2011 07:09 PM, Jason Palmer wrote:
> Hi,
> I am having trouble running a batch job in SGE using openmpi.  I have read
> the faq, which says that openmpi will automatically do the right thing, but
> something seems to be wrong.
> 
> Previously I used MPICH1 under SGE without any problems. I'm avoiding MPICH2
> because it doesn't seem to support static compilation, whereas I was able to
> get openmpi to compile with open64 and compile my program statically.
> 
> But I am having problems launching. According to the documentation, I should
> be able to have a script file, qsub.sh:
> 
> #!/bin/bash
> #$ -cwd
> #$ -j y
> #$ -S /bin/bash
> #$ -q all.q
> #$ -pe orte 18
> MPI_DIR=/home/jason/openmpi-1.4.3-install/bin
> /home/jason/openmpi-1.4.3-install/bin/mpirun -np $NSLOTS  myprog
> 

If you have SGE integration, you should not specify the number of slots
requested on the command-line. Open MPI will speak directly to SGE (or
vice versa, to get this information.

Also, what is the significance of specifying MPI_DIR? I think want to
add that to your PATH, and then export it to the rest of the nodes by
using the -V switch to qsub. If the correct mpirun isn't found first in
your PATH, your job will definitely fail when launched on the slave hosts.

You also should add the path to the MPI libraries to your LD_LIBRARY
PATH, too, or else you'll endup with run-time linking problems.

For example, I would change your submission script to look like this:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -q all.q
#$ -pe orte 18
#$ -V

MPI_DIR=/home/jason/openmpi-1.4.3-install
export PATH=$MPI_DIR/bin:$PATH
export LD_LIBRARY_PATH=$MPI_DIR/lib:$LD_LIBRARY_PATH

mpirun myprog

This may not fix all your problems, but will definitely fix some of them.

-- 
Prentice

Re: [OMPI users] printf and scanf problem of C code compiled with Open MPI

2011-03-29 Thread Prentice Bisbal

On 03/29/2011 01:29 PM, Meilin Bai wrote:
> Dear open-mpi users:
>  
> I come across a little problem when running a MPI C program compiled
> with Open MPI 1.4.3. A part of codes as follows:
>  
> MPI_Init(, );
> MPI_Comm_size(MPI_COMM_WORLD, );
> MPI_Comm_rank(MPI_COMM_WORLD, );
> MPI_Get_processor_name(processor_name, );
> if (myid == 0) {
>  printf("Please give N= ");
>  //fflush(stdout);
>  scanf("%d", );
>  startwtime = MPI_Wtime();
>  }
>  
> If comment out the sentence of "fflush(stdout);", it doesn't print out
> the message till I input an integer n. And if I add the fflush function
> between them, it works as expected, though comsumming time obviously.
>  
> However, when I compiled it with Mpich2-1.3.2p1, without fflush function
> in the code, it works correctly.
>  
> Can anyone know what the matter is.
>  

The Open MPI Developers (Jeff, Ralph, etc) can confirm this:

The MPI standard doesn't have a lot of strict requirements for I/O
behavior like this, so implementations are allowed to buffer I/O if they
want. There is nothing wrong with requiring fflush(stdout) in order to
get the behavior you want. In fact, if you check some text books on MPI
programming, I'm pretty sure they recommend using fflush to minimize
this problem.

MPICH behaves differently because its developers made different design
choices.

Neither behavior is "wrong".

-- 
Prentice

Re: [OMPI users] OpenMPI 1.2.x segfault as regular user

2011-03-21 Thread Prentice Bisbal

On 03/20/2011 06:22 PM, kevin.buck...@ecs.vuw.ac.nz wrote:
> 
>> It's not hard to test whether or not SELinux is the problem. You can
>> turn SELinux off on the command-line with this command:
>>
>> setenforce 0
>>
>> Of course, you need to be root in order to do this.
>>
>> After turning SELinux off, you can try reproducing the error. If it
>> still occurs, it's SELinux, if it doesn't the problem is elswhere. When
>> your done, you can reenable SELinux with
>>
>> setenforce 1
>>
>> If you're running your job across multiple nodes, you should disable
>> SELinux on all of them for testing.
> 
> You are not actually disabling SELinux with setenforce 0, just
> putting it into "permissive" mode: SELinux is still active.
> 

That's correct. Thanks for catching my inaccurate choice of words.

> Running SELinux in its permissive mode, as opposed to disabling it
> at boot time, sees SELinux making a log of things that would cause
> it to dive in, were it running in "enforcing" mode.

I forgot about that. Checking those logs will make debugging even easier
for the original poster.

> 
> There's then a tool you can run over that log that will suggest
> the ACL changes you need to make to fix the issue from an SELinux
> perspective.
> 

-- 
Prentice

Re: [OMPI users] What's wrong with this code?

2011-02-23 Thread Prentice Bisbal

Jeff Squyres wrote:
> On Feb 23, 2011, at 3:36 PM, Prentice Bisbal wrote:
> 
>> It's using MPI_STATUS_SIZE to dimension istatus before mpif.h is even
>> read! Correcting the order of the include and declaration statements
>> fixed the problem. D'oh!
> 
> A pox on old fortran for letting you use symbols before they are declared...
> 

I second that emotion.

The error message could have been a tad more helpful.

-- 
Prentice

Re: [OMPI users] What's wrong with this code?

2011-02-23 Thread Prentice Bisbal

Jeff Squyres wrote:
> On Feb 23, 2011, at 2:20 PM, Prentice Bisbal wrote:
> 
>> I suspected that and checked for it earlier. I just double-checked, and
>> that is not the problem. Out of the two source files, 'include mpif.h'
>> appears once, and 'use mpi' does not appear at all. I'm beginning to
>> suspect it is the compiler that is the problem. I'm using ifort 11.1.
>> It's not the latest version, but it's only about 1 year old.
> 
> 11.1 should be fine - I test with that regularly.
> 
> Can you put together a small example that shows the problem and isn't 
> proprietary?
> 

Jeff,

Thanks for requesting that. As I was looking at the oringinal code to
write a small test program, I found the source of the error. Doesn't it
aways work that way.

The code I'm debugging looked like this:

c main program
implicit integer(i-m)
integer ierr,istatus(MPI_STATUS_SIZE)
include 'mpif.h'
call MPI_Init(ierr)
call MPI_Comm_rank(MPI_COMM_WORLD,imy_rank,ierr)
call MPI_Comm_size(MPI_COMM_WORLD,iprocess,ierr)
call MPI_FINALIZE(ierr)
stop
end

Can you see the error?  Scroll down for answer ;)

It's using MPI_STATUS_SIZE to dimension istatus before mpif.h is even
read! Correcting the order of the include and declaration statements
fixed the problem. D'oh!

-- 
Prentice

Re: [OMPI users] What's wrong with this code?

2011-02-23 Thread Prentice Bisbal



Jeff Squyres wrote:
> I thought the error was this:
> 
> $ mpif90 -o simplex simplexmain579m.for simplexsubs579
> /usr/local/openmpi-1.2.8/intel-11/x86_64/include/mpif-config.h(88):
> error #6406: Conflicting attributes or multiple declaration of name.
> [MPI_STATUS_SIZE]
>  parameter (MPI_STATUS_SIZE=5)
> -^
> simplexmain579m.for(147): error #6591: An automatic object is invalid in
> a main program.   [ISTATUS]
>integer ierr,istatus(MPI_STATUS_SIZE)
> -^
> 
> which seems to only show the definition in mpif-config.h (which is an 
> internal OMPI file).  I could be mis-interpreting those compiler messages, 
> though...
> 
> Off-the-wall guess here: is the program doing both "use mpi" *and* "include 
> mpif.h" in the same subroutine...?

Jeff,

I suspected that and checked for it earlier. I just double-checked, and
that is not the problem. Out of the two source files, 'include mpif.h'
appears once, and 'use mpi' does not appear at all. I'm beginning to
suspect it is the compiler that is the problem. I'm using ifort 11.1.
It's not the latest version, but it's only about 1 year old.

$ ifort --version
ifort (IFORT) 11.1 20100203
Copyright (C) 1985-2010 Intel Corporation.  All rights reserved.

--
Prentice



> 
> 
> On Feb 23, 2011, at 11:51 AM, Tim Prince wrote:
> 
>> On 2/23/2011 8:27 AM, Prentice Bisbal wrote:
>>> Jeff Squyres wrote:
>>>> On Feb 23, 2011, at 9:48 AM, Tim Prince wrote:
>>>>
>>>>>> I agree with your logic, but the problem is where the code containing
>>>>>> the error is coming from - it's comping from a header files that's a
>>>>>> part of Open MPI, which makes me think this is a cmpiler error, since
>>>>>> I'm sure there are plenty of people using the same header file. in their
>>>>>> code.
>>>>>>
>>>>> Are you certain that they all find it necessary to re-define identifiers 
>>>>> from that header file, rather than picking parameter names which don't 
>>>>> conflict?
>>>> Without seeing the code, it sounds like Tim might be right: someone is 
>>>> trying to re-define the MPI_STATUS_SIZE parameter that is being defined by 
>>>> OMPI's mpif-config.h header file.  Regardless of include 
>>>> file/initialization ordering (i.e., regardless of whether mpif-config.h is 
>>>> the first or Nth entity to try to set this parameter), user code should 
>>>> never set this parameter value.
>>>>
>>>> Or any symbol that begins with MPI_, for that matter.  The entire "MPI_" 
>>>> namespace is reserved for MPI.
>>>>
>>> I understand that, and I checked the code to make sure the programmer
>>> didn't do anything stupid like that.
>>>
>>> The entire code is only a few hundred lines in two different files. In
>>> the entire program, there is only 1 include statement:
>>>
>>> include 'mpif.h'
>>>
>>> and MPI_STATUS_SIZE appears only once:
>>>
>>> integer ierr,istatus(MPI_STATUS_SIZE)
>>>
>>> I have limited knowledge of Fortran programming, but based on this, I
>>> don't see how MPI_STATUS_SIZE could be getting overwritten.
>>>
>>>
>> Earlier, you showed a preceding PARAMETER declaration setting a new value 
>> for that name, which would be required to make use of it in this context.  
>> Apparently, you intend to support only compilers which violate the Fortran 
>> standard by supporting a separate name space for PARAMETER identifiers, so 
>> that you can violate the MPI standard by using MPI_ identifiers in a manner 
>> which I believe is called shadowing in C.
>>
>> -- 
>> Tim Prince
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
>

Re: [OMPI users] What's wrong with this code?

2011-02-23 Thread Prentice Bisbal



Tim Prince wrote:
> On 2/23/2011 8:27 AM, Prentice Bisbal wrote:
>> Jeff Squyres wrote:
>>> On Feb 23, 2011, at 9:48 AM, Tim Prince wrote:
>>>
>>>>> I agree with your logic, but the problem is where the code containing
>>>>> the error is coming from - it's comping from a header files that's a
>>>>> part of Open MPI, which makes me think this is a cmpiler error, since
>>>>> I'm sure there are plenty of people using the same header file. in
>>>>> their
>>>>> code.
>>>>>
>>>> Are you certain that they all find it necessary to re-define
>>>> identifiers from that header file, rather than picking parameter
>>>> names which don't conflict?
>>>
>>> Without seeing the code, it sounds like Tim might be right: someone
>>> is trying to re-define the MPI_STATUS_SIZE parameter that is being
>>> defined by OMPI's mpif-config.h header file.  Regardless of include
>>> file/initialization ordering (i.e., regardless of whether
>>> mpif-config.h is the first or Nth entity to try to set this
>>> parameter), user code should never set this parameter value.
>>>
>>> Or any symbol that begins with MPI_, for that matter.  The entire
>>> "MPI_" namespace is reserved for MPI.
>>>
>>
>> I understand that, and I checked the code to make sure the programmer
>> didn't do anything stupid like that.
>>
>> The entire code is only a few hundred lines in two different files. In
>> the entire program, there is only 1 include statement:
>>
>> include 'mpif.h'
>>
>> and MPI_STATUS_SIZE appears only once:
>>
>> integer ierr,istatus(MPI_STATUS_SIZE)
>>
>> I have limited knowledge of Fortran programming, but based on this, I
>> don't see how MPI_STATUS_SIZE could be getting overwritten.
>>
>>
> Earlier, you showed a preceding PARAMETER declaration setting a new
> value for that name, which would be required to make use of it in this
> context.  Apparently, you intend to support only compilers which violate
> the Fortran standard by supporting a separate name space for PARAMETER
> identifiers, so that you can violate the MPI standard by using MPI_
> identifiers in a manner which I believe is called shadowing in C.
> 

Tim,

Check the original post again - that PARAMETER line you are referring to
 comes from the mpif-config.h file - not from my own code.

-- 
Prentice

Re: [OMPI users] What's wrong with this code?

2011-02-23 Thread Prentice Bisbal

Jeff Squyres wrote:
> On Feb 23, 2011, at 9:48 AM, Tim Prince wrote:
> 
>>> I agree with your logic, but the problem is where the code containing
>>> the error is coming from - it's comping from a header files that's a
>>> part of Open MPI, which makes me think this is a cmpiler error, since
>>> I'm sure there are plenty of people using the same header file. in their
>>> code.
>>>
>> Are you certain that they all find it necessary to re-define identifiers 
>> from that header file, rather than picking parameter names which don't 
>> conflict?
> 
> Without seeing the code, it sounds like Tim might be right: someone is trying 
> to re-define the MPI_STATUS_SIZE parameter that is being defined by OMPI's 
> mpif-config.h header file.  Regardless of include file/initialization 
> ordering (i.e., regardless of whether mpif-config.h is the first or Nth 
> entity to try to set this parameter), user code should never set this 
> parameter value.  
> 
> Or any symbol that begins with MPI_, for that matter.  The entire "MPI_" 
> namespace is reserved for MPI.
> 

I understand that, and I checked the code to make sure the programmer
didn't do anything stupid like that.

The entire code is only a few hundred lines in two different files. In
the entire program, there is only 1 include statement:

include 'mpif.h'

and MPI_STATUS_SIZE appears only once:

integer ierr,istatus(MPI_STATUS_SIZE)

I have limited knowledge of Fortran programming, but based on this, I
don't see how MPI_STATUS_SIZE could be getting overwritten.


-- 
Prentice

Re: [OMPI users] What's wrong with this code?

2011-02-23 Thread Prentice Bisbal



Tim Prince wrote:
> On 2/22/2011 1:41 PM, Prentice Bisbal wrote:
>> One of the researchers I support is writing some Fortran code that uses
>> Open MPI. The code is being compiled with the Intel Fortran compiler.
>> This one line of code:
>>
>> integer ierr,istatus(MPI_STATUS_SIZE)
>>
>> leads to these errors:
>>
>> $ mpif90 -o simplex simplexmain579m.for simplexsubs579
>> /usr/local/openmpi-1.2.8/intel-11/x86_64/include/mpif-config.h(88):
>> error #6406: Conflicting attributes or multiple declaration of name.
>> [MPI_STATUS_SIZE]
>>parameter (MPI_STATUS_SIZE=5)
>> -^
>> simplexmain579m.for(147): error #6591: An automatic object is invalid in
>> a main program.   [ISTATUS]
>>  integer ierr,istatus(MPI_STATUS_SIZE)
>> -^
>> simplexmain579m.for(147): error #6219: A specification expression object
>> must be a dummy argument, a COMMON block object, or an object accessible
>> through host or use association   [MPI_STATUS_SIZE]
>>  integer ierr,istatus(MPI_STATUS_SIZE)
>> -^
>> /usr/local/openmpi-1.2.8/intel-11/x86_64/include/mpif-common.h(211):
>> error #6756: A COMMON block data object must not be an automatic object.
>>[MPI_STATUS_IGNORE]
>>integer MPI_STATUS_IGNORE(MPI_STATUS_SIZE)
>> --^
>> /usr/local/openmpi-1.2.8/intel-11/x86_64/include/mpif-common.h(211):
>> error #6591: An automatic object is invalid in a main program.
>> [MPI_STATUS_IGNORE]
>>integer MPI_STATUS_IGNORE(MPI_STATUS_SIZE)
>>
>>
>> Any idea how to fix this? Is this a bug in the Intel compiler, or the
>> code?
>>
> 
> I can't see the code from here.  The first failure to recognize the
> PARAMETER definition apparently gives rise to the others.  According to
> the message, you already used the name MPI_STATUS_SIZE in mpif-config.h
> and now you are trying to give it another usage (not case sensitive) in
> the same scope.  If so, it seems good that the compiler catches it.

I agree with your logic, but the problem is where the code containing
the error is coming from - it's comping from a header files that's a
part of Open MPI, which makes me think this is a compiler error, since
I'm sure there are plenty of people using the same header file. in their
code.


-- 
Prentice

[OMPI users] What's wrong with this code?

2011-02-22 Thread Prentice Bisbal

One of the researchers I support is writing some Fortran code that uses
Open MPI. The code is being compiled with the Intel Fortran compiler.
This one line of code:

integer ierr,istatus(MPI_STATUS_SIZE)

leads to these errors:

$ mpif90 -o simplex simplexmain579m.for simplexsubs579
/usr/local/openmpi-1.2.8/intel-11/x86_64/include/mpif-config.h(88):
error #6406: Conflicting attributes or multiple declaration of name.
[MPI_STATUS_SIZE]
  parameter (MPI_STATUS_SIZE=5)
-^
simplexmain579m.for(147): error #6591: An automatic object is invalid in
a main program.   [ISTATUS]
integer ierr,istatus(MPI_STATUS_SIZE)
-^
simplexmain579m.for(147): error #6219: A specification expression object
must be a dummy argument, a COMMON block object, or an object accessible
through host or use association   [MPI_STATUS_SIZE]
integer ierr,istatus(MPI_STATUS_SIZE)
-^
/usr/local/openmpi-1.2.8/intel-11/x86_64/include/mpif-common.h(211):
error #6756: A COMMON block data object must not be an automatic object.
  [MPI_STATUS_IGNORE]
  integer MPI_STATUS_IGNORE(MPI_STATUS_SIZE)
--^
/usr/local/openmpi-1.2.8/intel-11/x86_64/include/mpif-common.h(211):
error #6591: An automatic object is invalid in a main program.
[MPI_STATUS_IGNORE]
  integer MPI_STATUS_IGNORE(MPI_STATUS_SIZE)


Any idea how to fix this? Is this a bug in the Intel compiler, or the code?

Some additional information:

$ mpif90 --showme
ifort -I/usr/local/openmpi-1.2.8/intel-11/x86_64/include
-I/usr/local/openmpi-1.2.8/intel-11/x86_64/lib
-L/usr/local/openmpi-1.2.8/intel-11/x86_64/lib -lmpi_f90 -lmpi_f77 -lmpi
-lopen-rte -lopen-pal -libverbs -lrt -lnuma -ldl -Wl,--export-dynamic
-lnsl -lutil

-- 
Prentice

Re: [OMPI users] libmpi.so.0 not found during gdb debugging

2011-02-11 Thread Prentice Bisbal

swagat mishra wrote:
> hello everyone,
> i have a network of systems connected over lan with each computer
> running ubuntu. openmpi 1.4.x is installed on 1 machine and the
> installation is mounted on other nodes through Networking File
> System(NFS). the source program and compiled file(a.out) are present in
> the mounted directory
> i run my programs by the following command:
> /opt/project/bin/mpirun -np 4 --prefix  /opt/project/ --hostfile
> hostfile a.out
> i have not set LD_LIBRARY_PATH but as i use --prefix mpirun works
> successfully
>  
> however as per the open mpi debugging faq:
> http://www.open-mpi.org/faq/?category=debugging
> when i run
> /opt/project/bin/mpirun -np 4 --prefix  /opt/project/ --hostfile
> hostfile -x DISPLAY=10.0.0.1:0.0 xterm -e gdb a.out
>  
> 4 xterm windows are opened with gdb running as expected. however when i
> give the command start to gdb in the windows corresponding to remote
> nodes, i get the error:
> libmpi.so.0 not found: no such file/directory
>  
> as mentioned other mpi jobs run fine with mpirun
>  
> when i execute
> /opt/project/bin/mpirun -np 4 --prefix  /opt/project/ -x
> DISPLAY=10.0.0.1:0.0 xterm -e gdb a.out ,the debugging continues succesfully
>  
> please help
> 

You need to set LD_LIBRARY_PATH to include the path to the OpenMPI
libraries. The --prefix option works for OpenMPI only; it has no effect
on other programs. You also need to make sure that the LD_LIBRARY_PATH
variable is correctly passed along to the other OpenMPI programs. For
processes on other hosts, this is usually done by editing your shell's
rc file for non-interactive logins (.bash_profile for bash).

-- 
Prentice

Re: [OMPI users] OpenMPI version syntax?

2011-02-03 Thread Prentice Bisbal

rpm -qi  might give you more detailed information.

If not, as a last resort, you can download and installed the SRPM and
then look at the name of the tarball in /usr/src/redhat/SOURCES.

Prentice

Jeffrey A Cummings wrote:
> The context was wrt the OpenMPI version that is bundled with a specific
> version of CentOS Linux which my IT folks are about to install on one of
> our servers.  Since the most recent 1.4 stream version is 1.4.3, I'm
> afraid that 1.4-4 is really some variant of 1.4 (i.e., 1.4.0) and hence
> not that new.
> 
> 
> 
> 
> From:Jeff Squyres 
> To:Open MPI Users 
> Date:02/02/2011 07:38 PM
> Subject:Re: [OMPI users] OpenMPI version syntax?
> Sent by:users-boun...@open-mpi.org
> 
> 
> 
> 
> On Feb 2, 2011, at 1:44 PM, Jeffrey A Cummings wrote:
> 
>> I've encountered a supposed OpenMPI version of 1.4-4.  Is the hyphen a
> typo or is this syntax correct and if so what does it mean?
> 
> Is this an RPM version number?  It's fairly common for RPMs to add "-X"
> at the end of the version number.  The "X" indicates the RPM version
> number (i.e., the version number of the packaging -- not the package
> itself).
> 
> Open MPI's version number scheme is explained here:
> 
>http://www.open-mpi.org/software/ompi/versions/
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>

Re: [OMPI users] How closely tied is a specific release of OpenMPI to the host operating system and other system software?

2011-02-02 Thread Prentice Bisbal

Jeffrey A Cummings wrote:
> I use OpenMPI on a variety of platforms:  stand-alone servers running
> Solaris on sparc boxes and Linux (mostly CentOS) on AMD/Intel boxes,
> also Linux (again CentOS) on large clusters of AMD/Intel boxes.  These
> platforms all have some version of the 1.3 OpenMPI stream.  I recently
> requested an upgrade on all systems to 1.4.3 (for production work) and
> 1.5.1 (for experimentation).  I'm getting a lot of push back from the
> SysAdmin folks claiming that OpenMPI is closely intertwined with the
> specific version of the operating system and/or other system software
> (i.e., Rocks on the clusters).  I need to know if they are telling me
> the truth or if they're just making excuses to avoid the work.  To state
> my question another way:  Apparently each release of Linux and/or Rocks
> comes with some version of OpenMPI bundled in.  Is it dangerous in some
> way to upgrade to a newer version of OpenMPI?  Thanks in advance for any
> insight anyone can provide.
> 
> - Jeff
> 

Jeff,

OpenMPI is more or less a user-space program, and isn't that tightly
coupled to the OS at all. As long as the OS has the correct network
drivers (ethernet, IB, or other), that's all OpenMPI needs to do it's
job. In fact, you can install it yourself in your own home directory (if
 your home directory is shared amongst the cluster nodes you want to
use), and run it from there - no special privileges needed.

I have many different versions of OpenMPI installed on my systems,
without a problem.

As a system administrator responsible for maintaining OpenMPI on several
clusters, it sounds like one of two things:

1. Your system administrators really don't know what they're talking
about, or,

2. They're lying to you to avoid doing work.

--
Prentice

Re: [OMPI users] Method for worker to determine its "rank" on a single machine?

2010-12-10 Thread Prentice Bisbal




On 12/10/2010 07:55 AM, Ralph Castain wrote:

Ick - I agree that's portable, but truly ugly.

Would it make sense to implement this as an MPI extension, and then
perhaps propose something to the Forum for this purpose?


I think that makes sense. As core and socket counts go up, I imagine the 
need for this information will become more common as programmers try to 
explicitly keep codes on a single socket or node.


Prentice



Just hate to see such a complex, time-consuming method when the info is
already available on every process.

On Dec 10, 2010, at 3:36 AM, Terry Dontje wrote:


A more portable way of doing what you want below is to gather each
processes processor_name given by MPI_Get_processor_name, have the
root who gets this data assign unique numbers to each name and then
scatter that info to the processes and have them use that as the color
to a MPI_Comm_split call. Once you've done that you can do a
MPI_Comm_size to find how many are on the node and be able to send to
all the other processes on that node using the new communicator.

Good luck,

--td
On 12/09/2010 08:18 PM, Ralph Castain wrote:

The answer is yes - sort of...

In OpenMPI, every process has information about not only its own local rank, 
but the local rank of all its peers regardless of what node they are on. We use 
that info internally for a variety of things.

Now the "sort of". That info isn't exposed via an MPI API at this time. If that 
doesn't matter, then I can tell you how to get it - it's pretty trivial to do.


On Dec 9, 2010, at 6:14 PM, David Mathog wrote:


Is it possible through MPI for a worker to determine:

  1. how many MPI processes are running on the local machine
  2. within that set its own "local rank"

?

For instance, a quad core with 4 processes might be hosting ranks 10,
14, 15, 20, in which case the "local ranks" would be 1,2,3,4.  The idea
being to use this information so that a program could selectively access
different local resources.  Simple example: on this 4 worker machine
reside telephone directories for Los Angeles, San Diego, San Jose, and
Sacramento.  Each worker is to open one database and search it when the
master sends a request.  With the "local rank" number this would be as
easy as naming the databases file1, file2, file3, and file4.  Without it
the 4 processes would have to communicate with each other somehow to
sort out which is to use which database.  And that could get ugly fast,
especially if they don't all start at the same time.

Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Help!!!!!!!!!!!!Openmpi instal for ubuntu 64 bits

2010-11-30 Thread Prentice Bisbal

Jeff Squyres wrote:
> Please note that this is an English-speaking list.  I don't know if Tim 
> speaks ?Spanish?, but I unfortunately don't.  :-)
> 

s/Spanish/Portuguese/

-- 
Prentice

1 2 >

1 - 100 of 157 matches

Mail list logo