Re: [OMPI users] Mac OS X Static PGI

2011-03-01 Thread Ralph Castain

On Mar 1, 2011, at 1:34 PM, David Robertson wrote:

> Hi,
> 
> > Error means OMPI didn't find a network interface - do you have your
> > networks turned off? Sometimes people travel with Airport turned off.
> > If you haven wire connected, then no interfaces exist.
> 
> I am logged in to the machine remotely through the wired interface. The 
> Airport is always off. I have Open MPI built and running fine with gcc/ifort 
> and gcc/gfortran using shared libraries. I have compiled and run successfully 
> with both shared and static libraries with gcc/ifort. I have not tried the 
> static libraries with gfortran/gcc.
> 
> ifconfig gives me:
> 
> lo0: flags=8049 mtu 16384
>inet6 ::1 prefixlen 128
>inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
>inet 127.0.0.1 netmask 0xff00
> gif0: flags=8010 mtu 1280
> stf0: flags=0<> mtu 1280
> en0: flags=8863 mtu 1500
>ether 10:9a:dd:55:bb:52
>inet6 fe80::129a:ddff:fe55:bb52%en0 prefixlen 64 scopeid 0x4
>inet 192.168.30.13 netmask 0xc000 broadcast 192.168.63.255
>media: autoselect (1000baseT )
>status: active
> fw0: flags=8863 mtu 4078
>lladdr 70:cd:60:ff:fe:2f:01:8e
>media: autoselect 
>status: inactive
> en1: flags=8863 mtu 1500
>ether c8:bc:c8:c9:fc:a9
>media: autoselect ()
>status: inactive
> vnic0: flags=8843 mtu 1500
>ether 00:1c:42:00:00:08
>inet 10.211.55.2 netmask 0xff00 broadcast 10.211.55.255
>media: autoselect
>status: active
> vnic1: flags=8843 mtu 1500
>ether 00:1c:42:00:00:09
>inet 10.37.129.2 netmask 0xff00 broadcast 10.37.129.255
>media: autoselect
>status: active
> vboxnet0: flags=8842 mtu 1500
>ether 0a:00:27:00:00:00
> 
> Are you saying that Open MPI is only looking for the Airport (en1) card and 
> not en0?

No, it isn't. However, what the error message says is as I indicated - it is 
failing because it is getting an error when trying to open a port on an 
available network. I can't debug your network to find out why. I know that Mac 
doesn't really like (nor does Apple really support) static builds, and it has 
been a long time since I have built it that way on my Mac. Looking at my old 
static config file, I don't see anything special in it.

That said, I know we had some early problems with static builds on the Mac 
(like I said, Apple doesn't really support it). Those were solved, though, and 
none of those problems had this symptom.

Could be something strange about PGI and socket libs when running static, but I 
wouldn't know - I don't use PGI.

Sorry I can't be of help - I suggest asking PGI about issues re socket support 
with their compiler on the Mac, or not using PGI if they only support static 
builds given Apple's lack of support for that mode of operation on the Mac 
(seems bizarre that PGI would demand it).


> Why would it do that for PGI only?


It doesn't, nor does it care what compiler is used.

> 
> Thanks,
> Dave
> 
> 
> On Mar 1, 2011, at 11:50 AM, David Robertson  wrote:
> 
> > Hi all,
> >
> > I am having trouble with PGI on Mac OS X 10.6.6. PGI's support staff has 
> > informed me that PGI does not "support 64-bit shared library creation" on 
> > the Mac. Therefore, I have built Open MPI in static only mode 
> > (--disable-shared --enable-static).
> >
> > I have to do some manipulation to get my application to pass the final 
> > linking stage (more on that at the bottom) but I get an immediate crash at 
> > runtime:
> >
> >
> >  start of output
> > bash-3.2$ mpirun -np 4 oceanG ocean_upwelling.in
> > [flask.marine.rutgers.edu:14186] opal_ifinit: unable to find network 
> > interfaces.
> > [flask.marine.rutgers.edu:14186] [[65522,0],0] ORTE_ERROR_LOG: Error in 
> > file ess_hnp_module.c at line 181
> > --
> > It looks like orte_init failed for some reason; your parallel process is
> > likely to abort. There are many reasons that a parallel process can
> > fail during orte_init; some of which are due to configuration or
> > environment problems. This failure appears to be an internal failure;
> > here's some additional information (which may only be relevant to an
> > Open MPI developer):
> >
> > orte_rml_base_select failed
> > --> Returned value Error (-1) instead of ORTE_SUCCESS
> > --
> > [flask.marine.rutgers.edu:14186] [[65522,0],0] ORTE_ERROR_LOG: Error in 
> > file runtime/orte_init.c at line 132
> > 

[OMPI users] MPI_ALLREDUCE bug with 1.5.2rc3r24441

2011-03-01 Thread Harald Anlauf
Hi,

there appears to be a regression in revision 1.5.2rc3r24441.
The attached program crashes even with 1 PE with:

 Default real, digits:   4  24
 Real kind,digits:   8  53
 Integer kind,   bits:   8   64
 Default Integer :   4  32
 Sum[real]:   1.000   2.000   3.000
 Sum[real(8)]:   1.2.
3.
 Sum[integer(4)]:   1   2   3
[proton:24826] *** An error occurred in MPI_Allreduce: the reduction
operation MPI_SUM is not defined on the MPI_INTEGER8 datatype


On the other hand,

% ompi_info --arch
 Configured architecture: i686-pc-linux-gnu
% ompi_info --all |grep 'integer[48]'
  Fort have integer4: yes
  Fort have integer8: yes
  Fort integer4 size: 4
  Fort integer8 size: 8
 Fort integer4 align: 4
 Fort integer8 align: 8

There are no problems with 1.4.x and earlier revisions.
program test
  use mpi
  implicit none
  integer, parameter :: i8 = selected_int_kind  (15)
  integer, parameter :: r8 = selected_real_kind (15,90)
  integer, parameter :: N  = 3

  integer :: i4i(N), i4s(N)
  integer(i8) :: i8i(N), i8s(N)
  real:: r4i(N), r4s(N)
  real(r8):: r8i(N), r8s(N)
  integer :: ierr, nproc, myrank, i

  i4i = (/ (i, i=1,N) /); i8i = (/ (i, i=1,N) /)
  r4i = (/ (i, i=1,N) /); r8i = (/ (i, i=1,N) /)

  call MPI_INIT  (ierr)
  call MPI_COMM_SIZE (MPI_COMM_WORLD, nproc,  ierr)
  call MPI_COMM_RANK (MPI_COMM_WORLD, myrank, ierr)

  if (myrank == 0) then
 print *, "Default real, digits:", kind (1.0), digits (1.0)
 print *, "Real kind,digits:", r8, digits (1._r8)
 print *, "Integer kind,   bits:", i8, bit_size (1_i8)
 print *, "Default Integer :", kind (1), bit_size (1)
  end if

  call MPI_ALLREDUCE (r4i, r4s, N, MPI_REAL, MPI_SUM, MPI_COMM_WORLD, ierr)
  if (myrank == 0)print *, "Sum[real]:", r4s

  call MPI_ALLREDUCE (r8i, r8s, N, MPI_REAL8,MPI_SUM, MPI_COMM_WORLD, ierr)
  if (myrank == 0)print *, "Sum[real(8)]:", r8s

  call MPI_ALLREDUCE (i4i, i4s, N, MPI_INTEGER4, MPI_SUM, MPI_COMM_WORLD, ierr)
  if (myrank == 0)print *, "Sum[integer(4)]:", i4s

  call MPI_ALLREDUCE (i8i, i8s, N, MPI_INTEGER8, MPI_SUM, MPI_COMM_WORLD, ierr)
  if (myrank == 0)print *, "Sum[integer(8)]:", i8s

  call MPI_FINALIZE (ierr)
end program test


Re: [OMPI users] Mac OS X Static PGI

2011-03-01 Thread David Robertson

Hi,

> Error means OMPI didn't find a network interface - do you have your
> networks turned off? Sometimes people travel with Airport turned off.
> If you haven wire connected, then no interfaces exist.

I am logged in to the machine remotely through the wired interface. The 
Airport is always off. I have Open MPI built and running fine with 
gcc/ifort and gcc/gfortran using shared libraries. I have compiled and 
run successfully with both shared and static libraries with gcc/ifort. I 
have not tried the static libraries with gfortran/gcc.


ifconfig gives me:

lo0: flags=8049 mtu 16384
inet6 ::1 prefixlen 128
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
inet 127.0.0.1 netmask 0xff00
gif0: flags=8010 mtu 1280
stf0: flags=0<> mtu 1280
en0: flags=8863 mtu 1500
ether 10:9a:dd:55:bb:52
inet6 fe80::129a:ddff:fe55:bb52%en0 prefixlen 64 scopeid 0x4
inet 192.168.30.13 netmask 0xc000 broadcast 192.168.63.255
media: autoselect (1000baseT )
status: active
fw0: flags=8863 mtu 4078
lladdr 70:cd:60:ff:fe:2f:01:8e
media: autoselect 
status: inactive
en1: flags=8863 mtu 1500
ether c8:bc:c8:c9:fc:a9
media: autoselect ()
status: inactive
vnic0: flags=8843 mtu 1500
ether 00:1c:42:00:00:08
inet 10.211.55.2 netmask 0xff00 broadcast 10.211.55.255
media: autoselect
status: active
vnic1: flags=8843 mtu 1500
ether 00:1c:42:00:00:09
inet 10.37.129.2 netmask 0xff00 broadcast 10.37.129.255
media: autoselect
status: active
vboxnet0: flags=8842 mtu 1500
ether 0a:00:27:00:00:00

Are you saying that Open MPI is only looking for the Airport (en1) card 
and not en0? Why would it do that for PGI only?


Thanks,
Dave


On Mar 1, 2011, at 11:50 AM, David Robertson  wrote:

> Hi all,
>
> I am having trouble with PGI on Mac OS X 10.6.6. PGI's support staff 
has informed me that PGI does not "support 64-bit shared library 
creation" on the Mac. Therefore, I have built Open MPI in static only 
mode (--disable-shared --enable-static).

>
> I have to do some manipulation to get my application to pass the 
final linking stage (more on that at the bottom) but I get an immediate 
crash at runtime:

>
>
>  start of output
> bash-3.2$ mpirun -np 4 oceanG ocean_upwelling.in
> [flask.marine.rutgers.edu:14186] opal_ifinit: unable to find network 
interfaces.
> [flask.marine.rutgers.edu:14186] [[65522,0],0] ORTE_ERROR_LOG: Error 
in file ess_hnp_module.c at line 181
> 
--

> It looks like orte_init failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> orte_rml_base_select failed
> --> Returned value Error (-1) instead of ORTE_SUCCESS
> 
--
> [flask.marine.rutgers.edu:14186] [[65522,0],0] ORTE_ERROR_LOG: Error 
in file runtime/orte_init.c at line 132
> 
--

> It looks like orte_init failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> orte_ess_set_name failed
> --> Returned value Error (-1) instead of ORTE_SUCCESS
> 
--
> [flask.marine.rutgers.edu:14186] [[65522,0],0] ORTE_ERROR_LOG: Error 
in file orterun.c at line 543

>  end of output
>
>
> When I google for this error the only result I find is for a patch to 
version 1.1.2 which doesn't even resemble the current state of the Open 
MPI code.

>
> iMac info:
>
> ProductName: Mac OS X
> ProductVersion: 10.6.6
> BuildVersion: 10J567
>
> Has anyone seen this before or have an idea what to try?
>
> Thanks,
> Dave
>
> P.S. I get the same results with Open MPI configured with:
>
> ./configure --prefix=/opt/pgisoft/openmpi/openmpi-1.4.3 CC=pgcc 
CXX=pgcpp F77=pgf77 FC=pgf90 --enable-mpirun-prefix-by-default 
--disable-shared --enable-static --without-memory-manager 

Re: [OMPI users] Mac OS X Static PGI

2011-03-01 Thread Ralph Castain
Error means OMPI didn't find a network interface - do you have your networks 
turned off? Sometimes people travel with Airport turned off. If you haven wire 
connected, then no interfaces exist.

Sent from my iPad

On Mar 1, 2011, at 11:50 AM, David Robertson  
wrote:

> Hi all,
> 
> I am having trouble with PGI on Mac OS X 10.6.6. PGI's support staff has 
> informed me that PGI does not "support 64-bit shared library creation" on the 
> Mac. Therefore, I have built Open MPI in static only mode (--disable-shared 
> --enable-static).
> 
> I have to do some manipulation to get my application to pass the final 
> linking stage (more on that at the bottom) but I get an immediate crash at 
> runtime:
> 
> 
>  start of output
> bash-3.2$ mpirun -np 4 oceanG ocean_upwelling.in
> [flask.marine.rutgers.edu:14186] opal_ifinit: unable to find network 
> interfaces.
> [flask.marine.rutgers.edu:14186] [[65522,0],0] ORTE_ERROR_LOG: Error in file 
> ess_hnp_module.c at line 181
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>  orte_rml_base_select failed
>  --> Returned value Error (-1) instead of ORTE_SUCCESS
> --
> [flask.marine.rutgers.edu:14186] [[65522,0],0] ORTE_ERROR_LOG: Error in file 
> runtime/orte_init.c at line 132
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>  orte_ess_set_name failed
>  --> Returned value Error (-1) instead of ORTE_SUCCESS
> --
> [flask.marine.rutgers.edu:14186] [[65522,0],0] ORTE_ERROR_LOG: Error in file 
> orterun.c at line 543
>  end of output
> 
> 
> When I google for this error the only result I find is for a patch to version 
> 1.1.2 which doesn't even resemble the current state of the Open MPI code.
> 
> iMac info:
> 
> ProductName:Mac OS X
> ProductVersion: 10.6.6
> BuildVersion:   10J567
> 
> Has anyone seen this before or have an idea what to try?
> 
> Thanks,
> Dave
> 
> P.S. I get the same results with Open MPI configured with:
> 
> ./configure --prefix=/opt/pgisoft/openmpi/openmpi-1.4.3 CC=pgcc CXX=pgcpp 
> F77=pgf77 FC=pgf90 --enable-mpirun-prefix-by-default --disable-shared 
> --enable-static --without-memory-manager --without-libnuma --disable-ipv6 
> --disable-io-romio --disable-heterogeneous --enable-mpi-f77 --enable-mpi-f90 
> --enable-mpi-profile
> 
> and
> 
> ./configure --prefix=/opt/pgisoft/openmpi/openmpi-1.4.3 CC=pgcc CXX=pgcpp 
> F77=pgf77 FC=pgf90 --disable-shared --enable-static
> 
> 
> 
> P.P.S. Linking workarounds:
> 
> Snow Leopard ships with Open MPI libraries that interfere when linking 
> programs built with my compiled mpif90. The problem is that 'ld' searches 
> every directory in the search path for shared objects before it will look for 
> static archives. That means a line like:
> 
> pgf90 x.o -o a.out -L/opt/openmpi/lib -lmpi_f90 -lmpi_f77 -lmpi
> 
> will use the .a file in /opt/openmpi/lib because Snow Leopard doesn't ship 
> with Fortran bindings but when it gets to -lmpi it picks up the libmpi.dylib 
> from /usr/lib and causes undefined references. Note the line above is 
> inferred using the -show:link option to mpif90.
> 
> I have found two workarounds to this. Edit the 
> share/openmpi/mpif90-wrapper-data.txt file to have full paths to the static 
> libraries (this is what the PGI shipped version of Open MPI does). The other 
> option is to add the line:
> 
> switch -search_paths_first is replace(-search_paths_first) positional(linker);
> 
> to the /path/to/pgi/bin/siterc file and set LDFLAGS to -search_paths_first in 
> my application.
> 
> from the ld manpage:
> 
> -search_paths_first
>  By default the -lx and -weak-lx options first search for a file
>  of the form `libx.dylib' in each directory in the library search
>  path, then a file of the form `libx.a' is searched for in the
>  library search paths.  This option changes it so that in each
>  path `libx.dylib' is searched for then `libx.a' before the next
>  path in the library search path is searched.
> 

[OMPI users] Mac OS X Static PGI

2011-03-01 Thread David Robertson

Hi all,

I am having trouble with PGI on Mac OS X 10.6.6. PGI's support staff has 
informed me that PGI does not "support 64-bit shared library creation" 
on the Mac. Therefore, I have built Open MPI in static only mode 
(--disable-shared --enable-static).


I have to do some manipulation to get my application to pass the final 
linking stage (more on that at the bottom) but I get an immediate crash 
at runtime:



 start of output
bash-3.2$ mpirun -np 4 oceanG ocean_upwelling.in
[flask.marine.rutgers.edu:14186] opal_ifinit: unable to find network 
interfaces.
[flask.marine.rutgers.edu:14186] [[65522,0],0] ORTE_ERROR_LOG: Error in 
file ess_hnp_module.c at line 181

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_rml_base_select failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--
[flask.marine.rutgers.edu:14186] [[65522,0],0] ORTE_ERROR_LOG: Error in 
file runtime/orte_init.c at line 132

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--
[flask.marine.rutgers.edu:14186] [[65522,0],0] ORTE_ERROR_LOG: Error in 
file orterun.c at line 543

 end of output


When I google for this error the only result I find is for a patch to 
version 1.1.2 which doesn't even resemble the current state of the Open 
MPI code.


iMac info:

ProductName:Mac OS X
ProductVersion: 10.6.6
BuildVersion:   10J567

Has anyone seen this before or have an idea what to try?

Thanks,
Dave

P.S. I get the same results with Open MPI configured with:

./configure --prefix=/opt/pgisoft/openmpi/openmpi-1.4.3 CC=pgcc 
CXX=pgcpp F77=pgf77 FC=pgf90 --enable-mpirun-prefix-by-default 
--disable-shared --enable-static --without-memory-manager 
--without-libnuma --disable-ipv6 --disable-io-romio 
--disable-heterogeneous --enable-mpi-f77 --enable-mpi-f90 
--enable-mpi-profile


and

./configure --prefix=/opt/pgisoft/openmpi/openmpi-1.4.3 CC=pgcc 
CXX=pgcpp F77=pgf77 FC=pgf90 --disable-shared --enable-static




P.P.S. Linking workarounds:

Snow Leopard ships with Open MPI libraries that interfere when linking 
programs built with my compiled mpif90. The problem is that 'ld' 
searches every directory in the search path for shared objects before it 
will look for static archives. That means a line like:


pgf90 x.o -o a.out -L/opt/openmpi/lib -lmpi_f90 -lmpi_f77 -lmpi

will use the .a file in /opt/openmpi/lib because Snow Leopard doesn't 
ship with Fortran bindings but when it gets to -lmpi it picks up the 
libmpi.dylib from /usr/lib and causes undefined references. Note the 
line above is inferred using the -show:link option to mpif90.


I have found two workarounds to this. Edit the 
share/openmpi/mpif90-wrapper-data.txt file to have full paths to the 
static libraries (this is what the PGI shipped version of Open MPI 
does). The other option is to add the line:


switch -search_paths_first is replace(-search_paths_first) 
positional(linker);


to the /path/to/pgi/bin/siterc file and set LDFLAGS to 
-search_paths_first in my application.


from the ld manpage:

-search_paths_first
  By default the -lx and -weak-lx options first search for a file
  of the form `libx.dylib' in each directory in the library search
  path, then a file of the form `libx.a' is searched for in the
  library search paths.  This option changes it so that in each
  path `libx.dylib' is searched for then `libx.a' before the next
  path in the library search path is searched.


Re: [OMPI users] RDMACM Differences

2011-03-01 Thread Jeff Squyres
On Feb 28, 2011, at 12:49 PM, Jagga Soorma wrote:

> -bash-3.2$ mpiexec --mca btl openib,self -mca btl_openib_warn_default_gid_
> prefix 0 -np 2 --hostfile mpihosts 
> /home/jagga/osu-micro-benchmarks-3.3/openmpi/ofed-1.5.2/bin/osu_latency

Your use of btl_openib_warn_default_gid_prefix may have brought up a subtle 
issue in Open MPI's verbs support.  More below.

> # OSU MPI Latency Test v3.3
> # SizeLatency (us)
> [amber04][[10252,1],1][connect/btl_openib_connect_oob.c:325:qp_connect_all] 
> error modifing QP to RTR errno says Invalid argument
> [amber04][[10252,1],1][connect/btl_openib_connect_oob.c:815:rml_recv_cb] 
> error in endpoint reply start connect

Looking at this error message and your ibv_devinfo output:

> [root@amber03 ~]# ibv_devinfo 
> hca_id:mlx4_0
> transport:InfiniBand (0)
> fw_ver:2.7.9294
> node_guid:78e7:d103:0021:8884
> sys_image_guid:78e7:d103:0021:8887
> vendor_id:0x02c9
> vendor_part_id:26438
> hw_ver:0xB0
> board_id:HP_020003
> phys_port_cnt:2
> port:1
> state:PORT_ACTIVE (4)
> max_mtu:2048 (4)
> active_mtu:2048 (4)
> sm_lid:1
> port_lid:20
> port_lmc:0x00
> link_layer:IB
> 
> port:2
> state:PORT_ACTIVE (4)
> max_mtu:2048 (4)
> active_mtu:1024 (3)
> sm_lid:0
> port_lid:0
> port_lmc:0x00
> link_layer:Ethernet

It looks like you have 1 HCA port as IB and the other at Ethernet.

I'm wondering if OMPI is not taking the device transport into account and is 
*only* using the subnet ID to determine reachability (i.e., I'm wondering if we 
didn't anticipate multiple devices/ports with the same subnet ID but with 
different transports).  I pointed this out to Mellanox yesterday; I think 
they're following up on it.

In the meantime, a workaround might be to set a non-default subnet ID on your 
IB network.  That should allow Open MPI to tell these networks apart without 
additional help.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

2011-03-01 Thread Joshua Hursey
I have not had the time to look into the performance problem yet, and probably 
won't for a little while. Can you send me a small program that illustrates the 
performance problem, and I'll file a bug so we don't lose track of it.

Thanks,
Josh

On Feb 25, 2011, at 1:31 PM, Nguyen Toan wrote:

> Dear Josh,
> 
> Did you find out the problem? I still cannot progress anything.
> Hope to hear some good news from you.
> 
> Regards,
> Nguyen Toan
> 
> On Sun, Feb 13, 2011 at 3:04 PM, Nguyen Toan  wrote:
> Hi Josh,
> 
> I tried the MCA parameter you mentioned but it did not help, the unknown 
> overhead still exists.
> Here I attach the output of 'ompi_info', both version 1.5 and 1.5.1.
> Hope you can find out the problem.
> Thank you.
> 
> Regards,
> Nguyen Toan
> 
> On Wed, Feb 9, 2011 at 11:08 PM, Joshua Hursey  wrote:
> It looks like the logic in the configure script is turning on the FT thread 
> for you when you specify both '--with-ft=cr' and '--enable-mpi-threads'.
> 
> Can you send me the output of 'ompi_info'? Can you also try the MCA parameter 
> that I mentioned earlier to see if that changes the performance?
> 
> I there are many non-blocking sends and receives, there might be performance 
> bug with the way the point-to-point wrapper is tracking request objects. If 
> the above MCA parameter does not help the situation, let me know and I might 
> be able to take a look at this next week.
> 
> Thanks,
> Josh
> 
> On Feb 9, 2011, at 1:40 AM, Nguyen Toan wrote:
> 
> > Hi Josh,
> > Thanks for the reply. I did not use the '--enable-ft-thread' option. Here 
> > is my build options:
> >
> > CFLAGS=-g \
> > ./configure \
> > --with-ft=cr \
> > --enable-mpi-threads \
> > --with-blcr=/home/nguyen/opt/blcr \
> > --with-blcr-libdir=/home/nguyen/opt/blcr/lib \
> > --prefix=/home/nguyen/opt/openmpi \
> > --with-openib \
> > --enable-mpirun-prefix-by-default
> >
> > My application requires lots of communication in every loop, focusing on 
> > MPI_Isend, MPI_Irecv and MPI_Wait. Also I want to make only one checkpoint 
> > per application execution for my purpose, but the unknown overhead exists 
> > even when no checkpoint was taken.
> >
> > Do you have any other idea?
> >
> > Regards,
> > Nguyen Toan
> >
> >
> > On Wed, Feb 9, 2011 at 12:41 AM, Joshua Hursey  
> > wrote:
> > There are a few reasons why this might be occurring. Did you build with the 
> > '--enable-ft-thread' option?
> >
> > If so, it looks like I didn't move over the thread_sleep_wait adjustment 
> > from the trunk - the thread was being a bit too aggressive. Try adding the 
> > following to your command line options, and see if it changes the 
> > performance.
> >  "-mca opal_cr_thread_sleep_wait 1000"
> >
> > There are other places to look as well depending on how frequently your 
> > application communicates, how often you checkpoint, process layout, ... But 
> > usually the aggressive nature of the thread is the main problem.
> >
> > Let me know if that helps.
> >
> > -- Josh
> >
> > On Feb 8, 2011, at 2:50 AM, Nguyen Toan wrote:
> >
> > > Hi all,
> > >
> > > I am using the latest version of OpenMPI (1.5.1) and BLCR (0.8.2).
> > > I found that when running an application,which uses MPI_Isend, MPI_Irecv 
> > > and MPI_Wait,
> > > enabling C/R, i.e using "-am ft-enable-cr", the application runtime is 
> > > much longer than the normal execution with mpirun (no checkpoint was 
> > > taken).
> > > This overhead becomes larger when the normal execution runtime is longer.
> > > Does anybody have any idea about this overhead, and how to eliminate it?
> > > Thanks.
> > >
> > > Regards,
> > > Nguyen
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > 
> > Joshua Hursey
> > Postdoctoral Research Associate
> > Oak Ridge National Laboratory
> > http://users.nccs.gov/~jjhursey
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey




Re: [OMPI users] Basic question on portability

2011-03-01 Thread Jeff Squyres
Yes, you will have problems.

We did not formally introduce ABI compatibility until version 1.3.2.  Meaning: 
your application compiled with 1.3.2 will successfully link/run against any 
1.3.x version >= 1.3.2, and against any 1.4.x version.

v1.5 broke ABI with the v1.3/v1.4 series, but it will also be stable for the 
duration of the v1.5/v1.6 series.  

We have no definite plans yet for v1.7, but it is likely that the ABI story 
will be the same there, too -- break from v1.5/v1.6 and stable for v1.7/v1.8.


On Mar 1, 2011, at 11:25 AM, Blosch, Edwin L wrote:

> If I compile against OpenMPI 1.2.8, shared linkage, on one system, then move 
> the executable to another system with OpenMPI 1.4.x or 1.5.x, will I have any 
> problems running the executable?
>  
> Thanks
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] using MPI through Qt

2011-03-01 Thread Eugene Loh

Eye RCS 51 wrote:


Hi,

In an effort to make a Qt gui using MPI, I have the following:

1. Gui started in master node.

2. In Gui, through a pushbutton, a global variable x is assigned some 
value; let say, x=1000;


3. I want this value to be know to all nodes. So I used broadcast in 
the function assigning it on the master node and all other nodes.


4. I printed values of x, which prints all 1000 in all nodes.

5. Now control has reached to MPI_Finalize in all nodes except master.

Now If I want to reassign value of x using pushbutton in master node 
and again broadcast to and print in all nodes, can it be done??


Not with MPI if MPI_Finalize has been called.

I mean, can I have an MPI function which through GUI is called many 
times and assigns and prints WHILE program is running.


You can call an MPI function like MPI_Bcast many times.  E.g.,

MPI_Init();
MPI_Comm_rank(...,);
while (...) {
  if ( myrank == MASTER ) x = ...;
  MPI_Bcast(,...);
}
MPI_Finalize();

There are many helpful MPI tutorials that can be found on the internet.



OR simply can I have a print function which is printing noderank value 
in all nodes whenever pushbutton is pressed while program is running.


command i used is "mpirun -np 3 ./a.out".




[OMPI users] Basic question on portability

2011-03-01 Thread Blosch, Edwin L
If I compile against OpenMPI 1.2.8, shared linkage, on one system, then move 
the executable to another system with OpenMPI 1.4.x or 1.5.x, will I have any 
problems running the executable?

Thanks


Re: [OMPI users] using MPI through Qt

2011-03-01 Thread David Zhang
Certainly you may call MPI functions many times, the problem is that you
need to have matching receives (or collectives) at your slave nodes, which
is only determined at run-time.  Perhaps this could be done with two
communications, the first broadcast the type of communications to the slaves
(for example, 1 for collective broadcast, 2 for scatter, etc.), you encode
whatever you wish in an integer.  Once the slaves receive the code they'll
respond correspondingly, posting the corresponding MPI receive.  Clearly, a
way to allow the slaves to exit the while loop is needed if you want the
slaves to exit cleanly, the exit code can also be encoded in the integer you
sent out.

On Tue, Mar 1, 2011 at 12:39 AM, Eye RCS 51  wrote:

> Hi,
>
> In an effort to make a Qt gui using MPI, I have the following:
>
> 1. Gui started in master node.
>
> 2. In Gui, through a pushbutton, a global variable x is assigned some
> value; let say, x=1000;
>
> 3. I want this value to be know to all nodes. So I used broadcast in the
> function assigning it on the master node and all other nodes.
>
> 4. I printed values of x, which prints all 1000 in all nodes.
>
> 5. Now control has reached to MPI_Finalize in all nodes except master.
>
> Now If I want to reassign value of x using pushbutton in master node and
> again broadcast to and print in all nodes, can it be done??
> I mean, can I have an MPI function which through GUI is called many times
> and assigns and prints WHILE program is running.
>
> OR simply can I have a print function which is printing noderank value in
> all nodes whenever pushbutton is pressed while program is running.
>
> command i used is "mpirun -np 3 ./a.out".
>
> Any help will be appreciated.
> Thanks you very much.
>
> --
> eye51
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
David Zhang
University of California, San Diego


Re: [OMPI users] RoCE (IBoE) & OpenMPI

2011-03-01 Thread Jeff Squyres
I thought you mentioned in a prior email that you had gotten one or two other 
OFED sample applications to work properly.  How are they setting the SL?  Are 
they not using the RDMA CM?


On Mar 1, 2011, at 7:35 AM, Michael Shuey wrote:

> So, since RoCE has no SM, and setting an SL is required to get
> lossless ethernet on Cisco switches (and possibly others), does this
> mean that RoCE will never work correctly with OpenMPI on Cisco
> hardware?
> 
> --
> Mike Shuey
> 
> 
> 
> On Tue, Mar 1, 2011 at 3:42 AM, Doron Shoham  wrote:
>> Hi,
>> 
>> Regarding to using a specific SL with RDMA CM, I've checked in the code and 
>> it seems that RDMA_CM uses the SL from the SA.
>> So if you want to configure a specific SL, you need to do it via the SM.
>> 
>> Doron
>> 
>> -Original Message-
>> From: Jeff Squyres [mailto:jsquy...@cisco.com]
>> Sent: Thursday, February 24, 2011 3:45 PM
>> To: Michael Shuey
>> Cc: Open MPI Users , Mike Dubman
>> Subject: Re: [OMPI users] RoCE (IBoE) & OpenMPI
>> 
>> On Feb 24, 2011, at 8:00 AM, Michael Shuey wrote:
>> 
>>> Late yesterday I did have a chance to test the patch Jeff provided
>>> (against 1.4.3 - testing 1.5.x is on the docket for today).  While it
>>> works, in that I can specify a gid_index,
>> 
>> Great!  I'll commit that to the trunk and start the process of moving it to 
>> the v1.5.x series (I know you haven't tested it yet, but it's essentially 
>> the same patch, just slightly adjusted for each of the 3 branches).
>> 
>>> it doesn't do everything
>>> required - my traffic won't match a lossless CoS on the ethernet
>>> switch.  Specifying a GID is only half of it; I really need to also
>>> specify a service level.
>> 
>> RoCE requires the use of the RDMA CM (I think?), and I didn't think there 
>> was a way to request a specific SL via the RDMA CM...?  (I could certainly 
>> be wrong here)
>> 
>> I think Mellanox will need to follow up with these questions...
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] RoCE (IBoE) & OpenMPI

2011-03-01 Thread Michael Shuey
So, since RoCE has no SM, and setting an SL is required to get
lossless ethernet on Cisco switches (and possibly others), does this
mean that RoCE will never work correctly with OpenMPI on Cisco
hardware?

--
Mike Shuey



On Tue, Mar 1, 2011 at 3:42 AM, Doron Shoham  wrote:
> Hi,
>
> Regarding to using a specific SL with RDMA CM, I've checked in the code and 
> it seems that RDMA_CM uses the SL from the SA.
> So if you want to configure a specific SL, you need to do it via the SM.
>
> Doron
>
> -Original Message-
> From: Jeff Squyres [mailto:jsquy...@cisco.com]
> Sent: Thursday, February 24, 2011 3:45 PM
> To: Michael Shuey
> Cc: Open MPI Users , Mike Dubman
> Subject: Re: [OMPI users] RoCE (IBoE) & OpenMPI
>
> On Feb 24, 2011, at 8:00 AM, Michael Shuey wrote:
>
>> Late yesterday I did have a chance to test the patch Jeff provided
>> (against 1.4.3 - testing 1.5.x is on the docket for today).  While it
>> works, in that I can specify a gid_index,
>
> Great!  I'll commit that to the trunk and start the process of moving it to 
> the v1.5.x series (I know you haven't tested it yet, but it's essentially the 
> same patch, just slightly adjusted for each of the 3 branches).
>
>> it doesn't do everything
>> required - my traffic won't match a lossless CoS on the ethernet
>> switch.  Specifying a GID is only half of it; I really need to also
>> specify a service level.
>
> RoCE requires the use of the RDMA CM (I think?), and I didn't think there was 
> a way to request a specific SL via the RDMA CM...?  (I could certainly be 
> wrong here)
>
> I think Mellanox will need to follow up with these questions...
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



[OMPI users] using MPI through Qt

2011-03-01 Thread Eye RCS 51
Hi,

In an effort to make a Qt gui using MPI, I have the following:

1. Gui started in master node.

2. In Gui, through a pushbutton, a global variable x is assigned some value;
let say, x=1000;

3. I want this value to be know to all nodes. So I used broadcast in the
function assigning it on the master node and all other nodes.

4. I printed values of x, which prints all 1000 in all nodes.

5. Now control has reached to MPI_Finalize in all nodes except master.

Now If I want to reassign value of x using pushbutton in master node and
again broadcast to and print in all nodes, can it be done??
I mean, can I have an MPI function which through GUI is called many times
and assigns and prints WHILE program is running.

OR simply can I have a print function which is printing noderank value in
all nodes whenever pushbutton is pressed while program is running.

command i used is "mpirun -np 3 ./a.out".

Any help will be appreciated.
Thanks you very much.

--
eye51