Re: [OMPI users] problem with mca_pml_ob1.so in openmpi-1.8.2rc2

2014-07-25 Thread Siegmar Gross
Hi,

> Can you try adding the
> 
> #include 
> 
> to pml_ob1_isend.c
> 
> And see if that resolves the issue?

Unfortunately it doesn't resolve the problem. I still get the
same error messages.


Kind regards

Siegmar


> 
> -Nathan
> 
> On Fri, Jul 25, 2014 at 07:59:21AM +0200, Siegmar Gross wrote:
> > Hi,
> > 
> > today I tried to track down the error which I reported for
> > my small program (running on Solaris 10 Sparc).
> > 
> > tyr hello_1 121 mpiexec -np 2 a.out 
> > Process 1 of 2 running on tyr.informatik.hs-fulda.de
> > Process 0 of 2 running on tyr.informatik.hs-fulda.de
> > Now 1 slave tasks are sending greetings.
> > ld.so.1: a.out: fatal: relocation error:
> >   file /usr/local/openmpi-1.8.2_64_cc/lib64/openmpi/mca_pml_ob1.so:
> >   symbol alloca: referenced symbol not found
> > ...
> > 
> > 
> > "alloca" is available.
> > 
> > tyr hello_1 122 more x.c 
> > #include 
> > #include 
> > #include 
> > 
> > int main (void)
> > {
> >   int *alloca_buffer;
> >   alloca_buffer = (int *) alloca (sizeof (int));
> >   *alloca_buffer = 1234;
> >   printf ("value: %d\n", *alloca_buffer);
> >   return EXIT_SUCCESS;
> > }
> > tyr hello_1 123 cc x.c 
> > tyr hello_1 124 a.out
> > value: 1234
> > tyr hello_1 125 
> > 
> > 
> > I get the following output if I run my original program in "dbx".
> > 
> > ...
> > RTC: Running program...
> > Write to unallocated (wua) on thread 1:
> > Attempting to write 1 byte at address 0x79f04000
> > t@1 (l@1) stopped in _readdir at 0x56574da0
> > 0x56574da0: _readdir+0x0064:call 
> > _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0x56742a80
> > Current function is find_dyn_components
> >   397   if (0 != lt_dlforeachfile(dir, save_filename, 
NULL)) 
> > {
> > (dbx) 
> > 
> > 
> > Hopefully the above output helps to fix the error. Can I provide
> > anything else? Thank you very much for any help in advance.
> > 
> > 
> > Kind regards
> > 
> > Siegmar
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/07/24868.php



Re: [OMPI users] SIGSEGV for Java program in openmpi-1.8.2rc2 onSolaris 10

2014-07-25 Thread Siegmar Gross
Hi Oscar,

> I'm sorry but I can not reproduce the problem.
> I recompiled all from scratch using Java 8, and it works ok on Debian 7.5.

I have no problem with Linux, but I have one with Solaris 10. If I use
the gcc-4.9.0 version everything works fine and if I use Sun C 5.12,
I get the warning which I mentioned in another email.

Kind regards

Siegmar


> El 25/07/14 18:28, Saliya Ekanayake escribió:
> > I too have encountered this as mentioned in one of my previous emails 
> > (http://comments.gmane.org/gmane.comp.clustering.open-mpi.user/21000). 
> > I've done many tests for our algorithms with 1.8.1 version and it 
> > didn't have this, but not sure about 1.8.2.
> >
> > Thank you,
> > saliya
> >
> >
> > On Fri, Jul 25, 2014 at 11:56 AM, Jeff Squyres (jsquyres) 
> > mailto:jsquy...@cisco.com>> wrote:
> >
> > That's quite odd that it only happens for Java programs -- it
> > should happen for *all* programs, based on the stack trace you've
> > shown.
> >
> > Can you print the value of the lds struct where the error occurs?
> >
> >
> > On Jul 25, 2014, at 2:29 AM, Siegmar Gross
> >  > > wrote:
> >
> > > Hi,
> > >
> > > I have installed openmpi-1.8.2rc2 with Sun c 5.12 on Solaris
> > > 10 Sparc and x86_64 and I receive a segmentation fault, if I
> > > run a small Java program.
> > >
> > > rs0 java 105 mpiexec -np 1 java InitFinalizeMain
> > > #
> > > # A fatal error has been detected by the Java Runtime Environment:
> > > #
> > > #  SIGSEGV (0xb) at pc=0x7ea3c830, pid=18363, tid=2
> > > #
> > > # JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build
> > 1.8.0-b132)
> > > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed
> > mode solaris-sparc
> > > compressed oops)
> > > # Problematic frame:
> > > # C  [libc.so.1+0x3c830]  strlen+0x50
> > > ...
> > >
> > >
> > > I get the following output if I run the program in "dbx".
> > >
> > > ...
> > > RTC: Running program...
> > > Write to unallocated (wua) on thread 1:
> > > Attempting to write 1 byte at address 0x79f04000
> > > t@1 (l@1) stopped in _readdir at 0x56574da0
> > > 0x56574da0: _readdir+0x0064:call
> > > _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0x56742a80
> > > Current function is find_dyn_components
> > >  397   if (0 != lt_dlforeachfile(dir,
> > save_filename, NULL))
> > > {
> > > (dbx)
> > >
> > >
> > > I get the following output if I run the program on Solaris 10
> > > x86_64.
> > >
> > > ...
> > > RTC: Running program...
> > > Reading disasm.so
> > > Read from uninitialized (rui) on thread 1:
> > > Attempting to read 1 byte at address 0x437387
> > >which is 15 bytes into a heap block of size 16 bytes at 0x437378
> > > This block was allocated from:
> > >[1] vasprintf() at 0xfd7fdc9b335a
> > >[2] asprintf() at 0xfd7fdc9b3452
> > >[3] opal_output_init() at line 184 in "output.c"
> > >[4] do_open() at line 548 in "output.c"
> > >[5] opal_output_open() at line 219 in "output.c"
> > >[6] opal_malloc_init() at line 68 in "malloc.c"
> > >[7] opal_init_util() at line 258 in "opal_init.c"
> > >[8] opal_init() at line 363 in "opal_init.c"
> > >
> > > t@1 (l@1) stopped in do_open at line 638 in file "output.c"
> > >  638   info[i].ldi_prefix = strdup(lds->lds_prefix);
> > > (dbx)
> > >
> > >
> > > Hopefully the above output helps to fix the errors. Can I provide
> > > anything else? Thank you very much for any help in advance.
> > >
> > >
> > > Kind regards
> > >
> > > Siegmar
> > >
> > > ___
> > > users mailing list
> > > us...@open-mpi.org 
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > Link to this post:
> > http://www.open-mpi.org/community/lists/users/2014/07/24870.php
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com 
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > ___
> > users mailing list
> > us...@open-mpi.org 
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> > http://www.open-mpi.org/community/lists/users/2014/07/24874.php
> >
> >
> >
> >
> > -- 
> > Saliya Ekanayake esal...@gmail.com 
> > Cell 812-391-4914 Home 812-961-6383
> > http://saliya.org
> >
> >
> > ___
> > users mailing list
> > us...@op

Re: [OMPI users] Trying to use openmpi with MOM getting a compile error

2014-07-25 Thread Gus Correa

On 07/25/2014 03:02 PM, Jeff Squyres (jsquyres) wrote:

On Jul 25, 2014, at 1:14 PM, Gus Correa  wrote:


Change the mkmf.template file and replace the Fortran
compiler name (gfortran) by the Open MPI (OMPI) Fortran compiler wrapper: 
mpifortran (or mpif90 if it still exists
in OMPI 1.8.1),


mpifort is the preferred Fortran compiler wrapper name in the 1.8.x series.
mpif90 still exists, but we'll likely remove that name in some future 
release

(not before 1.9.x, of course).




Hi Jeff

Oops! Sorry for my misspelling of mpifort.
(Intel must love this name choice! :) )
Well, hopefully Dan Shell found the right compiler wrapper.

I haven't got beyond the ol' mpif90 and OMPI 1.6.5.
Just waiting for 1.8.2 to be out in the sun to update.

Gus Correa


Re: [OMPI users] Trying to use openmpi with MOM getting a compile error

2014-07-25 Thread Jeff Squyres (jsquyres)
On Jul 25, 2014, at 1:14 PM, Gus Correa  wrote:

> Change the mkmf.template file and replace the Fortran
> compiler name (gfortran) by the Open MPI (OMPI) Fortran compiler wrapper: 
> mpifortran (or mpif90 if it still exists
> in OMPI 1.8.1),

mpifort is the preferred Fortran compiler wrapper name in the 1.8.x series.  
mpif90 still exists, but we'll likely remove that name in some future release 
(not before 1.9.x, of course).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] SIGSEGV for Java program in openmpi-1.8.2rc2 on Solaris 10

2014-07-25 Thread Oscar Vega-Gisbert

I'm sorry but I can not reproduce the problem.
I recompiled all from scratch using Java 8, and it works ok on Debian 7.5.

Regards,
Oscar


El 25/07/14 18:28, Saliya Ekanayake escribió:
I too have encountered this as mentioned in one of my previous emails 
(http://comments.gmane.org/gmane.comp.clustering.open-mpi.user/21000). 
I've done many tests for our algorithms with 1.8.1 version and it 
didn't have this, but not sure about 1.8.2.


Thank you,
saliya


On Fri, Jul 25, 2014 at 11:56 AM, Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:


That's quite odd that it only happens for Java programs -- it
should happen for *all* programs, based on the stack trace you've
shown.

Can you print the value of the lds struct where the error occurs?


On Jul 25, 2014, at 2:29 AM, Siegmar Gross
mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:

> Hi,
>
> I have installed openmpi-1.8.2rc2 with Sun c 5.12 on Solaris
> 10 Sparc and x86_64 and I receive a segmentation fault, if I
> run a small Java program.
>
> rs0 java 105 mpiexec -np 1 java InitFinalizeMain
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7ea3c830, pid=18363, tid=2
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build
1.8.0-b132)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed
mode solaris-sparc
> compressed oops)
> # Problematic frame:
> # C  [libc.so.1+0x3c830]  strlen+0x50
> ...
>
>
> I get the following output if I run the program in "dbx".
>
> ...
> RTC: Running program...
> Write to unallocated (wua) on thread 1:
> Attempting to write 1 byte at address 0x79f04000
> t@1 (l@1) stopped in _readdir at 0x56574da0
> 0x56574da0: _readdir+0x0064:call
> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0x56742a80
> Current function is find_dyn_components
>  397   if (0 != lt_dlforeachfile(dir,
save_filename, NULL))
> {
> (dbx)
>
>
> I get the following output if I run the program on Solaris 10
> x86_64.
>
> ...
> RTC: Running program...
> Reading disasm.so
> Read from uninitialized (rui) on thread 1:
> Attempting to read 1 byte at address 0x437387
>which is 15 bytes into a heap block of size 16 bytes at 0x437378
> This block was allocated from:
>[1] vasprintf() at 0xfd7fdc9b335a
>[2] asprintf() at 0xfd7fdc9b3452
>[3] opal_output_init() at line 184 in "output.c"
>[4] do_open() at line 548 in "output.c"
>[5] opal_output_open() at line 219 in "output.c"
>[6] opal_malloc_init() at line 68 in "malloc.c"
>[7] opal_init_util() at line 258 in "opal_init.c"
>[8] opal_init() at line 363 in "opal_init.c"
>
> t@1 (l@1) stopped in do_open at line 638 in file "output.c"
>  638   info[i].ldi_prefix = strdup(lds->lds_prefix);
> (dbx)
>
>
> Hopefully the above output helps to fix the errors. Can I provide
> anything else? Thank you very much for any help in advance.
>
>
> Kind regards
>
> Siegmar
>
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
http://www.open-mpi.org/community/lists/users/2014/07/24870.php


--
Jeff Squyres
jsquy...@cisco.com 
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
us...@open-mpi.org 
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/07/24874.php




--
Saliya Ekanayake esal...@gmail.com 
Cell 812-391-4914 Home 812-961-6383
http://saliya.org


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/07/24875.php




Re: [OMPI users] Trying to use openmpi with MOM getting a compile error

2014-07-25 Thread Dan Shell
Gus
Thank you I give that a try.
Dan Shell


-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Friday, July 25, 2014 1:15 PM
To: Open MPI Users
Subject: Re: [OMPI users] Trying to use openmpi with MOM getting a compile
error

Hi Dan

This is not an Open MPI problem.
This is most likely a problem with the MOM Makefile, which seems to be
missing your Open MPI include directory.

Change the mkmf.template file and replace the Fortran compiler name
(gfortran) by the Open MPI (OMPI) Fortran compiler
wrapper: mpifortran (or mpif90 if it still exists in OMPI 1.8.1), perhaps
using the full path to it.
The mpifortran/mpif90 compiler wrapper knows exactly where to find the Open
MPI include files, the libraries, etc.
You may need to comment out or remove spurious entries mkmf.template
pointing to other MPI implementations (e.g.
to MPICH libraries and include files).

Then rebuild the Makefile and compile MOM again.

I hope this helps.
Gus Correa


On 07/25/2014 12:37 PM, Dan Shell wrote:
> OpenMOM-mpi
> I am trying to compile MOM and have installed openmpi 1.8.1 getting an 
> installation error below Looking for some help in openmpi to make sure 
> the mpif.h is loaded correctly Should I use an older version of 
> openmpi?
> Thank You
> Dan Shell
>
> gfortran  -Duse_netCDF -Duse_netCDF3 -Duse_libMPI -DUSE_OCEAN_BGC
> -DENABLE_ODA -DSPMD -DLAND_BND_TRACERS -c
> -I/root/Desktop/NEW_MOM/newmom/src/shared/include
> -I/root/Desktop/NEW_MOM/newmom/src/shared/mpp/include
> /root/Desktop/NEW_MOM/newmom/src/shared/mpp/mpp_data.F90
> /root/Desktop/NEW_MOM/newmom/src/shared/mpp/mpp_data.F90:23:
>
> include 
> 1
> Error: Unclassifiable statement at (1)
> /root/Desktop/NEW_MOM/newmom/src/shared/mpp/include/mpp_data_mpi.inc:8.31:
>  Included at
> /root/Desktop/NEW_MOM/newmom/src/shared/mpp/mpp_data.F90:45:
>
> integer :: stat(MPI_STATUS_SIZE)
> 1
> Error: Symbol 'mpi_status_size' at (1) has no IMPLICIT type
> /root/Desktop/NEW_MOM/newmom/src/shared/mpp/mpp_data.F90:28.16:
>
>public :: stat, mpp_stack, ptr_stack, status, ptr_status, sync,
ptr_sync
>  1
> Error: The module or main program array 'stat' at (1) must have 
> constant shape
> make: *** [mpp_data.o] Error 1
> if ( 2 ) then
> echo Make failed to create  lib_FMS.a
> Make failed to create  lib_FMS.a
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/07/24876.php
>

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/07/24877.php



Re: [OMPI users] Trying to use openmpi with MOM getting a compile error

2014-07-25 Thread Gus Correa

Hi Dan

This is not an Open MPI problem.
This is most likely a problem with the MOM Makefile,
which seems to be missing your Open MPI include directory.

Change the mkmf.template file and replace the Fortran
compiler name (gfortran) by the Open MPI (OMPI) Fortran compiler 
wrapper: mpifortran (or mpif90 if it still exists

in OMPI 1.8.1), perhaps using the full path to it.
The mpifortran/mpif90 compiler wrapper knows exactly where to find
the Open MPI include files, the libraries, etc.
You may need to comment out or remove spurious entries
mkmf.template pointing to other MPI implementations (e.g.
to MPICH libraries and include files).

Then rebuild the Makefile and compile MOM again.

I hope this helps.
Gus Correa


On 07/25/2014 12:37 PM, Dan Shell wrote:

OpenMOM-mpi
I am trying to compile MOM and have installed openmpi 1.8.1 getting an
installation error below
Looking for some help in openmpi to make sure the mpif.h is loaded correctly
Should I use an older version of openmpi?
Thank You
Dan Shell

gfortran  -Duse_netCDF -Duse_netCDF3 -Duse_libMPI -DUSE_OCEAN_BGC
-DENABLE_ODA -DSPMD -DLAND_BND_TRACERS -c
-I/root/Desktop/NEW_MOM/newmom/src/shared/include
-I/root/Desktop/NEW_MOM/newmom/src/shared/mpp/include
/root/Desktop/NEW_MOM/newmom/src/shared/mpp/mpp_data.F90
/root/Desktop/NEW_MOM/newmom/src/shared/mpp/mpp_data.F90:23:

include 
1
Error: Unclassifiable statement at (1)
/root/Desktop/NEW_MOM/newmom/src/shared/mpp/include/mpp_data_mpi.inc:8.31:
 Included at
/root/Desktop/NEW_MOM/newmom/src/shared/mpp/mpp_data.F90:45:

integer :: stat(MPI_STATUS_SIZE)
1
Error: Symbol 'mpi_status_size' at (1) has no IMPLICIT type
/root/Desktop/NEW_MOM/newmom/src/shared/mpp/mpp_data.F90:28.16:

   public :: stat, mpp_stack, ptr_stack, status, ptr_status, sync, ptr_sync
 1
Error: The module or main program array 'stat' at (1) must have constant
shape
make: *** [mpp_data.o] Error 1
if ( 2 ) then
echo Make failed to create  lib_FMS.a
Make failed to create  lib_FMS.a



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/07/24876.php





[OMPI users] Trying to use openmpi with MOM getting a compile error

2014-07-25 Thread Dan Shell
OpenMOM-mpi
I am trying to compile MOM and have installed openmpi 1.8.1 getting an
installation error below
Looking for some help in openmpi to make sure the mpif.h is loaded correctly
Should I use an older version of openmpi?
Thank You
Dan Shell

gfortran  -Duse_netCDF -Duse_netCDF3 -Duse_libMPI -DUSE_OCEAN_BGC
-DENABLE_ODA -DSPMD -DLAND_BND_TRACERS -c
-I/root/Desktop/NEW_MOM/newmom/src/shared/include
-I/root/Desktop/NEW_MOM/newmom/src/shared/mpp/include
/root/Desktop/NEW_MOM/newmom/src/shared/mpp/mpp_data.F90
/root/Desktop/NEW_MOM/newmom/src/shared/mpp/mpp_data.F90:23:

include 
1
Error: Unclassifiable statement at (1)
/root/Desktop/NEW_MOM/newmom/src/shared/mpp/include/mpp_data_mpi.inc:8.31:
Included at /root/Desktop/NEW_MOM/newmom/src/shared/mpp/mpp_data.F90:45:

integer :: stat(MPI_STATUS_SIZE)
   1
Error: Symbol 'mpi_status_size' at (1) has no IMPLICIT type
/root/Desktop/NEW_MOM/newmom/src/shared/mpp/mpp_data.F90:28.16:

  public :: stat, mpp_stack, ptr_stack, status, ptr_status, sync, ptr_sync
1
Error: The module or main program array 'stat' at (1) must have constant
shape
make: *** [mpp_data.o] Error 1
if ( 2 ) then
echo Make failed to create  lib_FMS.a
Make failed to create  lib_FMS.a


Re: [OMPI users] SIGSEGV for Java program in openmpi-1.8.2rc2 on Solaris 10

2014-07-25 Thread Saliya Ekanayake
I too have encountered this as mentioned in one of my previous emails (
http://comments.gmane.org/gmane.comp.clustering.open-mpi.user/21000). I've
done many tests for our algorithms with 1.8.1 version and it didn't have
this, but not sure about 1.8.2.

Thank you,
saliya


On Fri, Jul 25, 2014 at 11:56 AM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:

> That's quite odd that it only happens for Java programs -- it should
> happen for *all* programs, based on the stack trace you've shown.
>
> Can you print the value of the lds struct where the error occurs?
>
>
> On Jul 25, 2014, at 2:29 AM, Siegmar Gross <
> siegmar.gr...@informatik.hs-fulda.de> wrote:
>
> > Hi,
> >
> > I have installed openmpi-1.8.2rc2 with Sun c 5.12 on Solaris
> > 10 Sparc and x86_64 and I receive a segmentation fault, if I
> > run a small Java program.
> >
> > rs0 java 105 mpiexec -np 1 java InitFinalizeMain
> > #
> > # A fatal error has been detected by the Java Runtime Environment:
> > #
> > #  SIGSEGV (0xb) at pc=0x7ea3c830, pid=18363, tid=2
> > #
> > # JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build
> 1.8.0-b132)
> > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode
> solaris-sparc
> > compressed oops)
> > # Problematic frame:
> > # C  [libc.so.1+0x3c830]  strlen+0x50
> > ...
> >
> >
> > I get the following output if I run the program in "dbx".
> >
> > ...
> > RTC: Running program...
> > Write to unallocated (wua) on thread 1:
> > Attempting to write 1 byte at address 0x79f04000
> > t@1 (l@1) stopped in _readdir at 0x56574da0
> > 0x56574da0: _readdir+0x0064:call
> > _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0x56742a80
> > Current function is find_dyn_components
> >  397   if (0 != lt_dlforeachfile(dir, save_filename,
> NULL))
> > {
> > (dbx)
> >
> >
> > I get the following output if I run the program on Solaris 10
> > x86_64.
> >
> > ...
> > RTC: Running program...
> > Reading disasm.so
> > Read from uninitialized (rui) on thread 1:
> > Attempting to read 1 byte at address 0x437387
> >which is 15 bytes into a heap block of size 16 bytes at 0x437378
> > This block was allocated from:
> >[1] vasprintf() at 0xfd7fdc9b335a
> >[2] asprintf() at 0xfd7fdc9b3452
> >[3] opal_output_init() at line 184 in "output.c"
> >[4] do_open() at line 548 in "output.c"
> >[5] opal_output_open() at line 219 in "output.c"
> >[6] opal_malloc_init() at line 68 in "malloc.c"
> >[7] opal_init_util() at line 258 in "opal_init.c"
> >[8] opal_init() at line 363 in "opal_init.c"
> >
> > t@1 (l@1) stopped in do_open at line 638 in file "output.c"
> >  638   info[i].ldi_prefix = strdup(lds->lds_prefix);
> > (dbx)
> >
> >
> > Hopefully the above output helps to fix the errors. Can I provide
> > anything else? Thank you very much for any help in advance.
> >
> >
> > Kind regards
> >
> > Siegmar
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24870.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24874.php
>



-- 
Saliya Ekanayake esal...@gmail.com
Cell 812-391-4914 Home 812-961-6383
http://saliya.org


Re: [OMPI users] SIGSEGV for Java program in openmpi-1.8.2rc2 on Solaris 10

2014-07-25 Thread Jeff Squyres (jsquyres)
That's quite odd that it only happens for Java programs -- it should happen for 
*all* programs, based on the stack trace you've shown.

Can you print the value of the lds struct where the error occurs?


On Jul 25, 2014, at 2:29 AM, Siegmar Gross 
 wrote:

> Hi,
> 
> I have installed openmpi-1.8.2rc2 with Sun c 5.12 on Solaris
> 10 Sparc and x86_64 and I receive a segmentation fault, if I
> run a small Java program.
> 
> rs0 java 105 mpiexec -np 1 java InitFinalizeMain
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7ea3c830, pid=18363, tid=2
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode 
> solaris-sparc 
> compressed oops)
> # Problematic frame:
> # C  [libc.so.1+0x3c830]  strlen+0x50
> ...
> 
> 
> I get the following output if I run the program in "dbx".
> 
> ...
> RTC: Running program...
> Write to unallocated (wua) on thread 1:
> Attempting to write 1 byte at address 0x79f04000
> t@1 (l@1) stopped in _readdir at 0x56574da0
> 0x56574da0: _readdir+0x0064:call 
> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0x56742a80
> Current function is find_dyn_components
>  397   if (0 != lt_dlforeachfile(dir, save_filename, 
> NULL)) 
> {
> (dbx) 
> 
> 
> I get the following output if I run the program on Solaris 10
> x86_64.
> 
> ...
> RTC: Running program...
> Reading disasm.so
> Read from uninitialized (rui) on thread 1:
> Attempting to read 1 byte at address 0x437387
>which is 15 bytes into a heap block of size 16 bytes at 0x437378
> This block was allocated from:
>[1] vasprintf() at 0xfd7fdc9b335a 
>[2] asprintf() at 0xfd7fdc9b3452 
>[3] opal_output_init() at line 184 in "output.c"
>[4] do_open() at line 548 in "output.c"
>[5] opal_output_open() at line 219 in "output.c"
>[6] opal_malloc_init() at line 68 in "malloc.c"
>[7] opal_init_util() at line 258 in "opal_init.c"
>[8] opal_init() at line 363 in "opal_init.c"
> 
> t@1 (l@1) stopped in do_open at line 638 in file "output.c"
>  638   info[i].ldi_prefix = strdup(lds->lds_prefix);
> (dbx) 
> 
> 
> Hopefully the above output helps to fix the errors. Can I provide
> anything else? Thank you very much for any help in advance.
> 
> 
> Kind regards
> 
> Siegmar
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/07/24870.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-07-25 Thread Jeff Squyres (jsquyres)
Siegmar --

This looks like the typical type of alignment error that we used to see when 
testing regularly on SPARC.  :-\

It looks like the error was happening in mca_db_hash.so.  Could you get a stack 
trace / file+line number where it was failing in mca_db_hash?  (i.e., the 
actual bad code will likely be under opal/mca/db/hash somewhere)


On Jul 25, 2014, at 2:08 AM, Siegmar Gross 
 wrote:

> Hi,
> 
> I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
> 10 Sparc and I receive a bus error, if I run a small program.
> 
> tyr hello_1 105 mpiexec -np 2 a.out 
> [tyr:29164] *** Process received signal ***
> [tyr:29164] Signal: Bus Error (10)
> [tyr:29164] Signal code: Invalid address alignment (1)
> [tyr:29164] Failing at address: 7fffd1c4
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xccfd0
> /lib/sparcv9/libc.so.1:0xd8b98
> /lib/sparcv9/libc.so.1:0xcc70c
> /lib/sparcv9/libc.so.1:0xcc918
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8
>  [ Signal 10 (BUS)]
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:PMPI_Init+0x2a8
> /home/fd1026/work/skripte/master/parallel/prog/mpi/hello_1/a.out:main+0x20
> /home/fd1026/work/skripte/master/parallel/prog/mpi/hello_1/a.out:_start+0x7c
> [tyr:29164] *** End of error message ***
> ...
> 
> 
> I get the following output if I run the program in "dbx".
> 
> ...
> RTC: Enabling Error Checking...
> RTC: Running program...
> Write to unallocated (wua) on thread 1:
> Attempting to write 1 byte at address 0x79f04000
> t@1 (l@1) stopped in _readdir at 0x55174da0
> 0x55174da0: _readdir+0x0064:call 
> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0x55342a80
> (dbx) 
> 
> 
> Hopefully the above output helps to fix the error. Can I provide
> anything else? Thank you very much for any help in advance.
> 
> 
> Kind regards
> 
> Siegmar
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/07/24869.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] problem with mca_pml_ob1.so in openmpi-1.8.2rc2

2014-07-25 Thread Nathan Hjelm

Can you try adding the

#include 

to pml_ob1_isend.c

And see if that resolves the issue?

-Nathan

On Fri, Jul 25, 2014 at 07:59:21AM +0200, Siegmar Gross wrote:
> Hi,
> 
> today I tried to track down the error which I reported for
> my small program (running on Solaris 10 Sparc).
> 
> tyr hello_1 121 mpiexec -np 2 a.out 
> Process 1 of 2 running on tyr.informatik.hs-fulda.de
> Process 0 of 2 running on tyr.informatik.hs-fulda.de
> Now 1 slave tasks are sending greetings.
> ld.so.1: a.out: fatal: relocation error:
>   file /usr/local/openmpi-1.8.2_64_cc/lib64/openmpi/mca_pml_ob1.so:
>   symbol alloca: referenced symbol not found
> ...
> 
> 
> "alloca" is available.
> 
> tyr hello_1 122 more x.c 
> #include 
> #include 
> #include 
> 
> int main (void)
> {
>   int *alloca_buffer;
>   alloca_buffer = (int *) alloca (sizeof (int));
>   *alloca_buffer = 1234;
>   printf ("value: %d\n", *alloca_buffer);
>   return EXIT_SUCCESS;
> }
> tyr hello_1 123 cc x.c 
> tyr hello_1 124 a.out
> value: 1234
> tyr hello_1 125 
> 
> 
> I get the following output if I run my original program in "dbx".
> 
> ...
> RTC: Running program...
> Write to unallocated (wua) on thread 1:
> Attempting to write 1 byte at address 0x79f04000
> t@1 (l@1) stopped in _readdir at 0x56574da0
> 0x56574da0: _readdir+0x0064:call 
> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0x56742a80
> Current function is find_dyn_components
>   397   if (0 != lt_dlforeachfile(dir, save_filename, 
> NULL)) 
> {
> (dbx) 
> 
> 
> Hopefully the above output helps to fix the error. Can I provide
> anything else? Thank you very much for any help in advance.
> 
> 
> Kind regards
> 
> Siegmar
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/07/24868.php


pgpKRX82s21PL.pgp
Description: PGP signature


[OMPI users] missing link option for openmpi-1.8.2rc2 on Linux

2014-07-25 Thread Siegmar Gross
Hi,

I installed openmpi-1.8.2rc2 on "openSUSE Linux 12.1 x86_64"
with Sun C 5.12 and get the following warning if I run a small
Java program. I get no warning for my gcc-4.9.0 version of
openmpi-1.8.2rc2.


linpc1 java 109 mpiexec -np 1 java InitFinalizeMain
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
  /usr/local/openmpi-1.8.2_64_cc/lib64/libmpi_java.so.1.2.0 which
  might have disabled stack guard. The VM will try to fix the stack
  guard now.
It's highly recommended that you fix the library with
  'execstack -c ', or link it with '-z noexecstack'.
Hello!
linpc1 java 110 


I would be grateful if somebody can add the link option
'-z noexecstack' to omit the warning. Thank you very much for
your help in advance.


Kind regards

Siegmar



[OMPI users] SIGSEGV for Java program in openmpi-1.8.2rc2 on Solaris 10

2014-07-25 Thread Siegmar Gross
Hi,

I have installed openmpi-1.8.2rc2 with Sun c 5.12 on Solaris
10 Sparc and x86_64 and I receive a segmentation fault, if I
run a small Java program.

rs0 java 105 mpiexec -np 1 java InitFinalizeMain
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7ea3c830, pid=18363, tid=2
#
# JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode solaris-sparc 
compressed oops)
# Problematic frame:
# C  [libc.so.1+0x3c830]  strlen+0x50
...


I get the following output if I run the program in "dbx".

...
RTC: Running program...
Write to unallocated (wua) on thread 1:
Attempting to write 1 byte at address 0x79f04000
t@1 (l@1) stopped in _readdir at 0x56574da0
0x56574da0: _readdir+0x0064:call 
_PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0x56742a80
Current function is find_dyn_components
  397   if (0 != lt_dlforeachfile(dir, save_filename, 
NULL)) 
{
(dbx) 


I get the following output if I run the program on Solaris 10
x86_64.

...
RTC: Running program...
Reading disasm.so
Read from uninitialized (rui) on thread 1:
Attempting to read 1 byte at address 0x437387
which is 15 bytes into a heap block of size 16 bytes at 0x437378
This block was allocated from:
[1] vasprintf() at 0xfd7fdc9b335a 
[2] asprintf() at 0xfd7fdc9b3452 
[3] opal_output_init() at line 184 in "output.c"
[4] do_open() at line 548 in "output.c"
[5] opal_output_open() at line 219 in "output.c"
[6] opal_malloc_init() at line 68 in "malloc.c"
[7] opal_init_util() at line 258 in "opal_init.c"
[8] opal_init() at line 363 in "opal_init.c"

t@1 (l@1) stopped in do_open at line 638 in file "output.c"
  638   info[i].ldi_prefix = strdup(lds->lds_prefix);
(dbx) 


Hopefully the above output helps to fix the errors. Can I provide
anything else? Thank you very much for any help in advance.


Kind regards

Siegmar



[OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-07-25 Thread Siegmar Gross
Hi,

I have installed openmpi-1.8.2rc2 with gcc-4.9.0 on Solaris
10 Sparc and I receive a bus error, if I run a small program.

tyr hello_1 105 mpiexec -np 2 a.out 
[tyr:29164] *** Process received signal ***
[tyr:29164] Signal: Bus Error (10)
[tyr:29164] Signal code: Invalid address alignment (1)
[tyr:29164] Failing at address: 7fffd1c4
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xccfd0
/lib/sparcv9/libc.so.1:0xd8b98
/lib/sparcv9/libc.so.1:0xcc70c
/lib/sparcv9/libc.so.1:0xcc918
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8
 [ Signal 10 (BUS)]
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:PMPI_Init+0x2a8
/home/fd1026/work/skripte/master/parallel/prog/mpi/hello_1/a.out:main+0x20
/home/fd1026/work/skripte/master/parallel/prog/mpi/hello_1/a.out:_start+0x7c
[tyr:29164] *** End of error message ***
...


I get the following output if I run the program in "dbx".

...
RTC: Enabling Error Checking...
RTC: Running program...
Write to unallocated (wua) on thread 1:
Attempting to write 1 byte at address 0x79f04000
t@1 (l@1) stopped in _readdir at 0x55174da0
0x55174da0: _readdir+0x0064:call 
_PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0x55342a80
(dbx) 


Hopefully the above output helps to fix the error. Can I provide
anything else? Thank you very much for any help in advance.


Kind regards

Siegmar



[OMPI users] problem with mca_pml_ob1.so in openmpi-1.8.2rc2

2014-07-25 Thread Siegmar Gross
Hi,

today I tried to track down the error which I reported for
my small program (running on Solaris 10 Sparc).

tyr hello_1 121 mpiexec -np 2 a.out 
Process 1 of 2 running on tyr.informatik.hs-fulda.de
Process 0 of 2 running on tyr.informatik.hs-fulda.de
Now 1 slave tasks are sending greetings.
ld.so.1: a.out: fatal: relocation error:
  file /usr/local/openmpi-1.8.2_64_cc/lib64/openmpi/mca_pml_ob1.so:
  symbol alloca: referenced symbol not found
...


"alloca" is available.

tyr hello_1 122 more x.c 
#include 
#include 
#include 

int main (void)
{
  int *alloca_buffer;
  alloca_buffer = (int *) alloca (sizeof (int));
  *alloca_buffer = 1234;
  printf ("value: %d\n", *alloca_buffer);
  return EXIT_SUCCESS;
}
tyr hello_1 123 cc x.c 
tyr hello_1 124 a.out
value: 1234
tyr hello_1 125 


I get the following output if I run my original program in "dbx".

...
RTC: Running program...
Write to unallocated (wua) on thread 1:
Attempting to write 1 byte at address 0x79f04000
t@1 (l@1) stopped in _readdir at 0x56574da0
0x56574da0: _readdir+0x0064:call 
_PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0x56742a80
Current function is find_dyn_components
  397   if (0 != lt_dlforeachfile(dir, save_filename, 
NULL)) 
{
(dbx) 


Hopefully the above output helps to fix the error. Can I provide
anything else? Thank you very much for any help in advance.


Kind regards

Siegmar