Re: [OMPI users] Bug in OpenMPI-1.8.3: storage limition in shared memory allocation (MPI_WIN_ALLOCATE_SHARED) in Ftn-code

2014-10-24 Thread Jeff Squyres (jsquyres)
Nathan tells me that this may well be related to a fix that was literally just 
pulled into the v1.8 branch today:

https://github.com/open-mpi/ompi-release/pull/56

Would you mind testing any nightly tarball after tonight?  (i.e., the v1.8 
tarballs generated tonight will be the first ones to contain this fix)

http://www.open-mpi.org/nightly/master/



On Oct 24, 2014, at 11:46 AM,   
wrote:

> Dear developers of OPENMPI,
>  
> I am running a small downsized Fortran-testprogram for shared memory 
> allocation (using MPI_WIN_ALLOCATE_SHARED and  MPI_WIN_SHARED_QUERY) )
> on only 1 node   of 2 different Linux-clusters with OPENMPI-1.8.3 and 
> Intel-14.0.4 /Intel-13.0.1, respectively.
>  
> The program simply allocates a sequence of shared data windows, each 
> consisting of 1 integer*4-array.
> None of the windows is freed, so the amount of allocated data  in shared 
> windows raises during the course of the execution.
>  
> That worked well on the 1st cluster (Laki, having 8 procs per node))  when 
> allocating even 1000 shared windows each having 5 integer*4 array 
> elements,
> i.e. a total of  200 MBytes.
> On the 2nd cluster (Cluster5, having 24 procs per node) it also worked on the 
> login node, but it did NOT work on a compute node.
> In that error case, there occurs something like an internal storage limit of 
> ~ 140 MB for the total storage allocated in all shared windows.
> When that limit is reached, all later shared memory allocations fail (but 
> silently).
> So the first attempt to use such a bad shared data window results in a bus 
> error due to the bad storage address encountered.
>  
> That strange behavior could be observed in the small testprogram but also 
> with my large Fortran CFD-code.
> If the error occurs, then it occurs with both codes, and both at a storage 
> limit of  ~140 MB.
> I found that this storage limit depends only weakly on  the number of 
> processes (for np=2,4,8,16,24  it is: 144.4 , 144.0, 141.0, 137.0, 132.2 MB)
>  
> Note that the shared memory storage available on both clusters was very large 
> (many GB of free memory).
>  
> Here is the error message when running with np=2 and an  array dimension of 
> idim_1=5  for the integer*4 array allocated per shared window
> on the compute node of Cluster5:
> In that case, the error occurred at the 723-th shared window, which is the 
> 1st badly allocated window in that case:
> (722 successfully allocated shared windows * 5 array elements * 4 
> Bytes/el. = 144.4 MB)
>  
>  
> [1,0]: on nodemaster: iwin= 722 :
> [1,0]:  total storage [MByte] alloc. in shared windows so far:   
> 144.4000
> [1,0]: === allocation of shared window no. iwin= 723
> [1,0]:  starting now with idim_1=   5
> [1,0]: on nodemaster for iwin= 723 : before writing 
> on shared mem
> [1,0]:[r5i5n13:12597] *** Process received signal ***
> [1,0]:[r5i5n13:12597] Signal: Bus error (7)
> [1,0]:[r5i5n13:12597] Signal code: Non-existant physical address (2)
> [1,0]:[r5i5n13:12597] Failing at address: 0x7fffe08da000
> [1,0]:[r5i5n13:12597] [ 0] 
> [1,0]:/lib64/libpthread.so.0(+0xf800)[0x76d67800]
> [1,0]:[r5i5n13:12597] [ 1] ./a.out[0x408a8b]
> [1,0]:[r5i5n13:12597] [ 2] ./a.out[0x40800c]
> [1,0]:[r5i5n13:12597] [ 3] 
> [1,0]:/lib64/libc.so.6(__libc_start_main+0xe6)[0x769fec36]
> [1,0]:[r5i5n13:12597] [ 4] [1,0]:./a.out[0x407f09]
> [1,0]:[r5i5n13:12597] *** End of error message ***
> [1,1]:forrtl: error (78): process killed (SIGTERM)
> [1,1]:Image  PCRoutineLine
> Source
> [1,1]:libopen-pal.so.6   74B74580  Unknown   
> Unknown  Unknown
> [1,1]:libmpi.so.177267F3E  Unknown   
> Unknown  Unknown
> [1,1]:libmpi.so.17733B555  Unknown   
> Unknown  Unknown
> [1,1]:libmpi.so.17727DFFD  Unknown   
> Unknown  Unknown
> [1,1]:libmpi_mpifh.so.2  7779BA03  Unknown   
> Unknown  Unknown
> [1,1]:a.out  00408D15  Unknown   
> Unknown  Unknown
> [1,1]:a.out  0040800C  Unknown   
> Unknown  Unknown
> [1,1]:libc.so.6  769FEC36  Unknown   
> Unknown  Unknown
> [1,1]:a.out  00407F09  Unknown   
> Unknown  Unknown
> --
> mpiexec noticed that process rank 0 with PID 12597 on node r5i5n13 exited on 
> signal 7 (Bus error).
> --
>  
>  
> The small Ftn-testprogram was built by   
>   mpif90 sharedmemtest.f90
>   mpiexec -np 2 -bind-to core -tag-output ./a.out
>  
> Why does it work on the Laki  (both on login-node and on a compute node)  as 
> well as on the login-node of Cluster5,
> but fails on an 

Re: [OMPI users] Problem with Yosemite

2014-10-24 Thread Ralph Castain
Found that you do have to use the Apple version of libtool, however, to build - 
the Darwin ports “glibtool” version will fail.

Tested the 1.8.3 tarball and it again worked fine.


> On Oct 24, 2014, at 10:46 AM, Ralph Castain  wrote:
> 
> Will do - just taking forever to update my Darwin ports so I can try with 
> their libtool version
> 
> Guillaume - did you remember to update your ports before building? Major 
> changes to support Yosemite.
> 
> 
>> On Oct 24, 2014, at 10:01 AM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>> Ralph --
>> 
>> Can you try a 1.8 nightly tarball build on Y?
>> 
>> 
>> 
>> On Oct 24, 2014, at 12:32 PM, Ralph Castain  wrote:
>> 
>>> Could well be - I’m using the libtool from Apple
>>> 
>>> Apple Inc. version cctools-855
>>> 
>>> Just verified that 1.8 is working fine as well.
>>> Ralph
>>> 
>>> 
 On Oct 24, 2014, at 9:23 AM, Bert Wesarg  
 wrote:
 
 This is maybe related to a problem in libtool:
 
 http://lists.gnu.org/archive/html/libtool-patches/2014-09/msg2.html
 
 On Fri, Oct 24, 2014 at 6:09 PM, Ralph Castain  wrote:
> I was able to build and run the trunk without problem on Yosemite with:
> 
> gcc (MacPorts gcc49 4.9.1_0) 4.9.1
> GNU Fortran (MacPorts gcc49 4.9.1_0) 4.9.1
> 
> Will test 1.8 branch now, though I believe the fortran support in 1.8 is
> up-to-date
> 
> 
> On Oct 24, 2014, at 6:46 AM, Guillaume Houzeaux 
> 
> wrote:
> 
> Good morning/afternoon/night,
> 
> I actualized my OS two days ago from Maverick to Yosemite (Xcode6.1).
> I recompiled openmpi-1.4.1, openmpi-1.6.3, openmpi-1.8.3 and get the same
> problem.
> 
> What I did:
> 
> 1. mpif90 test_openmpi.f90 -o test_openmpi.x
> 2. ./test_openmpi.x
> 
> and get
> 
> BEFORE
> [bsccs456.int.bsc.es:81492] [[INVALID],INVALID] ORTE_ERROR_LOG: Unknown
> error: -1 in file runtime/orte_globals.c at line 173
> input in flex scanner failed
> 
> 
> Thanks in advance,
> 
> Guillaume
> 
> --
> Le camembert, de son fumet de venaison, avait vaincu les odeurs plus 
> sourdes
> du marolles
> et du limbourg; il élargissait ses exhalaisons, étouffait les autres
> senteurs sous une
> abondance surprenante d’haleines gâtées. Cependant, au milieu de cette
> phrase vigoureuse,
> le parmesan jetait par moments un filet mince de flûte champêtre ; tandis
> que les brie y
> mettaient des douceurs fades de tambourins humides. Il y eut une reprise
> suffocante du
> livarot. Et cette symphonie se tint un moment sur une note aiguë du géromé
> anisé,
> prolongée en point d’orgue.
> 
> Emile Zola - Le Ventre de Paris
> 
> Guillaume Houzeaux
> Team Leader
> Dpt. Computer Applications in Science and Engineering
> Barcelona Supercomputing Center (BSC-CNS)
> Edificio NEXUS I, Office 204
> c) Gran Capitan 2-4
> 08034 Barcelona, Spain
> 
> Tel: +34 93 405 4291
> Fax: +34 93 413 7721
> Skype user: guillaume_houzeaux_bsc
> WWW: CASE department
> ResearcherID: D-4950-2012
> Scientific Profile:
> 
>  
> 
> 
> 
> WARNING / LEGAL TEXT: This message is intended only for the use of the
> individual or entity to which it is addressed and may contain information
> which is privileged, confidential, proprietary, or exempt from disclosure
> under applicable law. If you are not the intended recipient or the person
> responsible for delivering the message to the intended recipient, you are
> strictly prohibited from disclosing, distributing, copying, or in any way
> using this message. If you have received this communication in error, 
> please
> notify the sender and destroy and delete any copies you may have received.
> 
> http://www.bsc.es/disclaimer
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/10/25570.php
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/10/25573.php
 ___
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: 
 http://www.open-mpi.org/community/lists/users/2014/10/25574.php
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org

Re: [OMPI users] Problem with Yosemite

2014-10-24 Thread Ralph Castain
Will do - just taking forever to update my Darwin ports so I can try with their 
libtool version

Guillaume - did you remember to update your ports before building? Major 
changes to support Yosemite.


> On Oct 24, 2014, at 10:01 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Ralph --
> 
> Can you try a 1.8 nightly tarball build on Y?
> 
> 
> 
> On Oct 24, 2014, at 12:32 PM, Ralph Castain  wrote:
> 
>> Could well be - I’m using the libtool from Apple
>> 
>> Apple Inc. version cctools-855
>> 
>> Just verified that 1.8 is working fine as well.
>> Ralph
>> 
>> 
>>> On Oct 24, 2014, at 9:23 AM, Bert Wesarg  wrote:
>>> 
>>> This is maybe related to a problem in libtool:
>>> 
>>> http://lists.gnu.org/archive/html/libtool-patches/2014-09/msg2.html
>>> 
>>> On Fri, Oct 24, 2014 at 6:09 PM, Ralph Castain  wrote:
 I was able to build and run the trunk without problem on Yosemite with:
 
 gcc (MacPorts gcc49 4.9.1_0) 4.9.1
 GNU Fortran (MacPorts gcc49 4.9.1_0) 4.9.1
 
 Will test 1.8 branch now, though I believe the fortran support in 1.8 is
 up-to-date
 
 
 On Oct 24, 2014, at 6:46 AM, Guillaume Houzeaux 
 wrote:
 
 Good morning/afternoon/night,
 
 I actualized my OS two days ago from Maverick to Yosemite (Xcode6.1).
 I recompiled openmpi-1.4.1, openmpi-1.6.3, openmpi-1.8.3 and get the same
 problem.
 
 What I did:
 
 1. mpif90 test_openmpi.f90 -o test_openmpi.x
 2. ./test_openmpi.x
 
 and get
 
 BEFORE
 [bsccs456.int.bsc.es:81492] [[INVALID],INVALID] ORTE_ERROR_LOG: Unknown
 error: -1 in file runtime/orte_globals.c at line 173
 input in flex scanner failed
 
 
 Thanks in advance,
 
 Guillaume
 
 --
 Le camembert, de son fumet de venaison, avait vaincu les odeurs plus 
 sourdes
 du marolles
 et du limbourg; il élargissait ses exhalaisons, étouffait les autres
 senteurs sous une
 abondance surprenante d’haleines gâtées. Cependant, au milieu de cette
 phrase vigoureuse,
 le parmesan jetait par moments un filet mince de flûte champêtre ; tandis
 que les brie y
 mettaient des douceurs fades de tambourins humides. Il y eut une reprise
 suffocante du
 livarot. Et cette symphonie se tint un moment sur une note aiguë du géromé
 anisé,
 prolongée en point d’orgue.
 
 Emile Zola - Le Ventre de Paris
 
 Guillaume Houzeaux
 Team Leader
 Dpt. Computer Applications in Science and Engineering
 Barcelona Supercomputing Center (BSC-CNS)
 Edificio NEXUS I, Office 204
 c) Gran Capitan 2-4
 08034 Barcelona, Spain
 
 Tel: +34 93 405 4291
 Fax: +34 93 413 7721
 Skype user: guillaume_houzeaux_bsc
 WWW: CASE department
 ResearcherID: D-4950-2012
 Scientific Profile:
 
  
 
 
 
 WARNING / LEGAL TEXT: This message is intended only for the use of the
 individual or entity to which it is addressed and may contain information
 which is privileged, confidential, proprietary, or exempt from disclosure
 under applicable law. If you are not the intended recipient or the person
 responsible for delivering the message to the intended recipient, you are
 strictly prohibited from disclosing, distributing, copying, or in any way
 using this message. If you have received this communication in error, 
 please
 notify the sender and destroy and delete any copies you may have received.
 
 http://www.bsc.es/disclaimer
 ___
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post:
 http://www.open-mpi.org/community/lists/users/2014/10/25570.php
 
 
 
 ___
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post:
 http://www.open-mpi.org/community/lists/users/2014/10/25573.php
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/10/25574.php
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/10/25576.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: 

Re: [OMPI users] Problem with Yosemite

2014-10-24 Thread Jeff Squyres (jsquyres)
Ralph --

Can you try a 1.8 nightly tarball build on Y?



On Oct 24, 2014, at 12:32 PM, Ralph Castain  wrote:

> Could well be - I’m using the libtool from Apple
> 
>  Apple Inc. version cctools-855
> 
> Just verified that 1.8 is working fine as well.
> Ralph
> 
> 
>> On Oct 24, 2014, at 9:23 AM, Bert Wesarg  wrote:
>> 
>> This is maybe related to a problem in libtool:
>> 
>> http://lists.gnu.org/archive/html/libtool-patches/2014-09/msg2.html
>> 
>> On Fri, Oct 24, 2014 at 6:09 PM, Ralph Castain  wrote:
>>> I was able to build and run the trunk without problem on Yosemite with:
>>> 
>>> gcc (MacPorts gcc49 4.9.1_0) 4.9.1
>>> GNU Fortran (MacPorts gcc49 4.9.1_0) 4.9.1
>>> 
>>> Will test 1.8 branch now, though I believe the fortran support in 1.8 is
>>> up-to-date
>>> 
>>> 
>>> On Oct 24, 2014, at 6:46 AM, Guillaume Houzeaux 
>>> wrote:
>>> 
>>> Good morning/afternoon/night,
>>> 
>>> I actualized my OS two days ago from Maverick to Yosemite (Xcode6.1).
>>> I recompiled openmpi-1.4.1, openmpi-1.6.3, openmpi-1.8.3 and get the same
>>> problem.
>>> 
>>> What I did:
>>> 
>>> 1. mpif90 test_openmpi.f90 -o test_openmpi.x
>>> 2. ./test_openmpi.x
>>> 
>>> and get
>>> 
>>> BEFORE
>>> [bsccs456.int.bsc.es:81492] [[INVALID],INVALID] ORTE_ERROR_LOG: Unknown
>>> error: -1 in file runtime/orte_globals.c at line 173
>>> input in flex scanner failed
>>> 
>>> 
>>> Thanks in advance,
>>> 
>>> Guillaume
>>> 
>>> --
>>> Le camembert, de son fumet de venaison, avait vaincu les odeurs plus sourdes
>>> du marolles
>>> et du limbourg; il élargissait ses exhalaisons, étouffait les autres
>>> senteurs sous une
>>> abondance surprenante d’haleines gâtées. Cependant, au milieu de cette
>>> phrase vigoureuse,
>>> le parmesan jetait par moments un filet mince de flûte champêtre ; tandis
>>> que les brie y
>>> mettaient des douceurs fades de tambourins humides. Il y eut une reprise
>>> suffocante du
>>> livarot. Et cette symphonie se tint un moment sur une note aiguë du géromé
>>> anisé,
>>> prolongée en point d’orgue.
>>> 
>>> Emile Zola - Le Ventre de Paris
>>> 
>>> Guillaume Houzeaux
>>> Team Leader
>>> Dpt. Computer Applications in Science and Engineering
>>> Barcelona Supercomputing Center (BSC-CNS)
>>> Edificio NEXUS I, Office 204
>>> c) Gran Capitan 2-4
>>> 08034 Barcelona, Spain
>>> 
>>> Tel: +34 93 405 4291
>>> Fax: +34 93 413 7721
>>> Skype user: guillaume_houzeaux_bsc
>>> WWW: CASE department
>>> ResearcherID: D-4950-2012
>>> Scientific Profile:
>>> 
>>>  
>>> 
>>> 
>>> 
>>> WARNING / LEGAL TEXT: This message is intended only for the use of the
>>> individual or entity to which it is addressed and may contain information
>>> which is privileged, confidential, proprietary, or exempt from disclosure
>>> under applicable law. If you are not the intended recipient or the person
>>> responsible for delivering the message to the intended recipient, you are
>>> strictly prohibited from disclosing, distributing, copying, or in any way
>>> using this message. If you have received this communication in error, please
>>> notify the sender and destroy and delete any copies you may have received.
>>> 
>>> http://www.bsc.es/disclaimer
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/10/25570.php
>>> 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/10/25573.php
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/10/25574.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/10/25576.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Problem with Yosemite

2014-10-24 Thread Ralph Castain
Could well be - I’m using the libtool from Apple

 Apple Inc. version cctools-855

Just verified that 1.8 is working fine as well.
Ralph


> On Oct 24, 2014, at 9:23 AM, Bert Wesarg  wrote:
> 
> This is maybe related to a problem in libtool:
> 
> http://lists.gnu.org/archive/html/libtool-patches/2014-09/msg2.html
> 
> On Fri, Oct 24, 2014 at 6:09 PM, Ralph Castain  wrote:
>> I was able to build and run the trunk without problem on Yosemite with:
>> 
>> gcc (MacPorts gcc49 4.9.1_0) 4.9.1
>> GNU Fortran (MacPorts gcc49 4.9.1_0) 4.9.1
>> 
>> Will test 1.8 branch now, though I believe the fortran support in 1.8 is
>> up-to-date
>> 
>> 
>> On Oct 24, 2014, at 6:46 AM, Guillaume Houzeaux 
>> wrote:
>> 
>> Good morning/afternoon/night,
>> 
>> I actualized my OS two days ago from Maverick to Yosemite (Xcode6.1).
>> I recompiled openmpi-1.4.1, openmpi-1.6.3, openmpi-1.8.3 and get the same
>> problem.
>> 
>> What I did:
>> 
>> 1. mpif90 test_openmpi.f90 -o test_openmpi.x
>> 2. ./test_openmpi.x
>> 
>> and get
>> 
>> BEFORE
>> [bsccs456.int.bsc.es:81492] [[INVALID],INVALID] ORTE_ERROR_LOG: Unknown
>> error: -1 in file runtime/orte_globals.c at line 173
>> input in flex scanner failed
>> 
>> 
>> Thanks in advance,
>> 
>> Guillaume
>> 
>> --
>> Le camembert, de son fumet de venaison, avait vaincu les odeurs plus sourdes
>> du marolles
>> et du limbourg; il élargissait ses exhalaisons, étouffait les autres
>> senteurs sous une
>> abondance surprenante d’haleines gâtées. Cependant, au milieu de cette
>> phrase vigoureuse,
>> le parmesan jetait par moments un filet mince de flûte champêtre ; tandis
>> que les brie y
>> mettaient des douceurs fades de tambourins humides. Il y eut une reprise
>> suffocante du
>> livarot. Et cette symphonie se tint un moment sur une note aiguë du géromé
>> anisé,
>> prolongée en point d’orgue.
>> 
>> Emile Zola - Le Ventre de Paris
>> 
>> Guillaume Houzeaux
>> Team Leader
>> Dpt. Computer Applications in Science and Engineering
>> Barcelona Supercomputing Center (BSC-CNS)
>> Edificio NEXUS I, Office 204
>> c) Gran Capitan 2-4
>> 08034 Barcelona, Spain
>> 
>> Tel: +34 93 405 4291
>> Fax: +34 93 413 7721
>> Skype user: guillaume_houzeaux_bsc
>> WWW: CASE department
>> ResearcherID: D-4950-2012
>> Scientific Profile:
>> 
>>  
>> 
>> 
>> 
>> WARNING / LEGAL TEXT: This message is intended only for the use of the
>> individual or entity to which it is addressed and may contain information
>> which is privileged, confidential, proprietary, or exempt from disclosure
>> under applicable law. If you are not the intended recipient or the person
>> responsible for delivering the message to the intended recipient, you are
>> strictly prohibited from disclosing, distributing, copying, or in any way
>> using this message. If you have received this communication in error, please
>> notify the sender and destroy and delete any copies you may have received.
>> 
>> http://www.bsc.es/disclaimer
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/10/25570.php
>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/10/25573.php
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/10/25574.php



Re: [OMPI users] which info is needed for SIGSEGV in Java foropenmpi-dev-124-g91e9686on Solaris

2014-10-24 Thread Siegmar Gross
Hi Gilles,

thank you very much for your help.

> how did you configure openmpi ? which java version did you use ?
> 
> i just found a regression and you currently have to explicitly add
> CFLAGS=-D_REENTRANT CPPFLAGS=-D_REENTRANT
> to your configure command line

I added "-D_REENTRANT" to my command.

../openmpi-dev-124-g91e9686/configure --prefix=/usr/local/openmpi-1.9.0_64_gcc \
  --libdir=/usr/local/openmpi-1.9.0_64_gcc/lib64 \
  --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
  --with-jdk-headers=/usr/local/jdk1.8.0/include \
  JAVA_HOME=/usr/local/jdk1.8.0 \
  LDFLAGS="-m64" CC="gcc" CXX="g++" FC="gfortran" \
  CFLAGS="-m64 -D_REENTRANT" CXXFLAGS="-m64" FCFLAGS="-m64" \
  CPP="cpp" CXXCPP="cpp" \
  CPPFLAGS="-D_REENTRANT" CXXCPPFLAGS="" \
  --enable-mpi-cxx \
  --enable-cxx-exceptions \
  --enable-mpi-java \
  --enable-heterogeneous \
  --enable-mpi-thread-multiple \
  --with-threads=posix \
  --with-hwloc=internal \
  --without-verbs \
  --with-wrapper-cflags="-std=c11 -m64" \
  --enable-debug \
  |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc

I use Java 8.

tyr openmpi-1.9 112 java -version
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)
tyr openmpi-1.9 113 

Unfortunately I still get a SIGSEGV with openmpi-dev-124-g91e9686.
I have applied your patch and will try to debug my small Java
program tomorrow or next week and then let you know the result.


Kind regards and thank you very much once more

Siegmar



Re: [OMPI users] Problem with Yosemite

2014-10-24 Thread Bert Wesarg
This is maybe related to a problem in libtool:

http://lists.gnu.org/archive/html/libtool-patches/2014-09/msg2.html

On Fri, Oct 24, 2014 at 6:09 PM, Ralph Castain  wrote:
> I was able to build and run the trunk without problem on Yosemite with:
>
> gcc (MacPorts gcc49 4.9.1_0) 4.9.1
> GNU Fortran (MacPorts gcc49 4.9.1_0) 4.9.1
>
> Will test 1.8 branch now, though I believe the fortran support in 1.8 is
> up-to-date
>
>
> On Oct 24, 2014, at 6:46 AM, Guillaume Houzeaux 
> wrote:
>
> Good morning/afternoon/night,
>
> I actualized my OS two days ago from Maverick to Yosemite (Xcode6.1).
> I recompiled openmpi-1.4.1, openmpi-1.6.3, openmpi-1.8.3 and get the same
> problem.
>
> What I did:
>
> 1. mpif90 test_openmpi.f90 -o test_openmpi.x
> 2. ./test_openmpi.x
>
> and get
>
>  BEFORE
> [bsccs456.int.bsc.es:81492] [[INVALID],INVALID] ORTE_ERROR_LOG: Unknown
> error: -1 in file runtime/orte_globals.c at line 173
> input in flex scanner failed
>
>
> Thanks in advance,
>
> Guillaume
>
> --
> Le camembert, de son fumet de venaison, avait vaincu les odeurs plus sourdes
> du marolles
> et du limbourg; il élargissait ses exhalaisons, étouffait les autres
> senteurs sous une
> abondance surprenante d’haleines gâtées. Cependant, au milieu de cette
> phrase vigoureuse,
> le parmesan jetait par moments un filet mince de flûte champêtre ; tandis
> que les brie y
> mettaient des douceurs fades de tambourins humides. Il y eut une reprise
> suffocante du
> livarot. Et cette symphonie se tint un moment sur une note aiguë du géromé
> anisé,
> prolongée en point d’orgue.
>
> Emile Zola - Le Ventre de Paris
>
> Guillaume Houzeaux
> Team Leader
> Dpt. Computer Applications in Science and Engineering
> Barcelona Supercomputing Center (BSC-CNS)
> Edificio NEXUS I, Office 204
> c) Gran Capitan 2-4
> 08034 Barcelona, Spain
>
> Tel: +34 93 405 4291
> Fax: +34 93 413 7721
> Skype user: guillaume_houzeaux_bsc
> WWW: CASE department
> ResearcherID: D-4950-2012
> Scientific Profile:
>
>  
>
>
>
> WARNING / LEGAL TEXT: This message is intended only for the use of the
> individual or entity to which it is addressed and may contain information
> which is privileged, confidential, proprietary, or exempt from disclosure
> under applicable law. If you are not the intended recipient or the person
> responsible for delivering the message to the intended recipient, you are
> strictly prohibited from disclosing, distributing, copying, or in any way
> using this message. If you have received this communication in error, please
> notify the sender and destroy and delete any copies you may have received.
>
> http://www.bsc.es/disclaimer
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/10/25570.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/10/25573.php


Re: [OMPI users] Problem with Yosemite

2014-10-24 Thread Ralph Castain
I was able to build and run the trunk without problem on Yosemite with:

gcc (MacPorts gcc49 4.9.1_0) 4.9.1
GNU Fortran (MacPorts gcc49 4.9.1_0) 4.9.1

Will test 1.8 branch now, though I believe the fortran support in 1.8 is 
up-to-date


> On Oct 24, 2014, at 6:46 AM, Guillaume Houzeaux  
> wrote:
> 
> Good morning/afternoon/night,
> 
> I actualized my OS two days ago from Maverick to Yosemite (Xcode6.1).
> I recompiled openmpi-1.4.1, openmpi-1.6.3, openmpi-1.8.3 and get the same 
> problem.
> 
> What I did:
> 
> 1. mpif90 test_openmpi.f90 -o test_openmpi.x
> 2. ./test_openmpi.x
> 
> and get
> 
>  BEFORE
> [bsccs456.int.bsc.es:81492] [[INVALID],INVALID] ORTE_ERROR_LOG: Unknown 
> error: -1 in file runtime/orte_globals.c at line 173
> input in flex scanner failed
> 
> 
> Thanks in advance,
> 
> Guillaume
> 
> -- 
> Le camembert, de son fumet de venaison, avait vaincu les odeurs plus sourdes 
> du marolles 
> et du limbourg; il élargissait ses exhalaisons, étouffait les autres senteurs 
> sous une 
> abondance surprenante d’haleines gâtées. Cependant, au milieu de cette phrase 
> vigoureuse, 
> le parmesan jetait par moments un filet mince de flûte champêtre ; tandis que 
> les brie y 
> mettaient des douceurs fades de tambourins humides. Il y eut une reprise 
> suffocante du 
> livarot. Et cette symphonie se tint un moment sur une note aiguë du géromé 
> anisé, 
> prolongée en point d’orgue.
> 
> Emile Zola - Le Ventre de Paris
> 
> Guillaume Houzeaux
> Team Leader
> Dpt. Computer Applications in Science and Engineering
> Barcelona Supercomputing Center (BSC-CNS)
> Edificio NEXUS I, Office 204
> c) Gran Capitan 2-4
> 08034 Barcelona, Spain
> 
> Tel: +34 93 405 4291
> Fax: +34 93 413 7721
> Skype user: guillaume_houzeaux_bsc
> WWW: CASE department  
> ResearcherID: D-4950-2012 
> Scientific Profile:  
>   
>   
>   
>  
> 
>  
> 
> 
> 
> WARNING / LEGAL TEXT: This message is intended only for the use of the 
> individual or entity to which it is addressed and may contain information 
> which is privileged, confidential, proprietary, or exempt from disclosure 
> under applicable law. If you are not the intended recipient or the person 
> responsible for delivering the message to the intended recipient, you are 
> strictly prohibited from disclosing, distributing, copying, or in any way 
> using this message. If you have received this communication in error, please 
> notify the sender and destroy and delete any copies you may have received. 
> 
> http://www.bsc.es/disclaimer  
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/10/25570.php



[OMPI users] Bug in OpenMPI-1.8.3: storage limition in shared memory allocation (MPI_WIN_ALLOCATE_SHARED) in Ftn-code

2014-10-24 Thread Michael.Rachner
Dear developers of OPENMPI,

I am running a small downsized Fortran-testprogram for shared memory allocation 
(using MPI_WIN_ALLOCATE_SHARED and  MPI_WIN_SHARED_QUERY) )
on only 1 node   of 2 different Linux-clusters with OPENMPI-1.8.3 and 
Intel-14.0.4 /Intel-13.0.1, respectively.

The program simply allocates a sequence of shared data windows, each consisting 
of 1 integer*4-array.
None of the windows is freed, so the amount of allocated data  in shared 
windows raises during the course of the execution.

That worked well on the 1st cluster (Laki, having 8 procs per node))  when 
allocating even 1000 shared windows each having 5 integer*4 array elements,
i.e. a total of  200 MBytes.
On the 2nd cluster (Cluster5, having 24 procs per node) it also worked on the 
login node, but it did NOT work on a compute node.
In that error case, there occurs something like an internal storage limit of ~ 
140 MB for the total storage allocated in all shared windows.
When that limit is reached, all later shared memory allocations fail (but 
silently).
So the first attempt to use such a bad shared data window results in a bus 
error due to the bad storage address encountered.

That strange behavior could be observed in the small testprogram but also with 
my large Fortran CFD-code.
If the error occurs, then it occurs with both codes, and both at a storage 
limit of  ~140 MB.
I found that this storage limit depends only weakly on  the number of processes 
(for np=2,4,8,16,24  it is: 144.4 , 144.0, 141.0, 137.0, 132.2 MB)

Note that the shared memory storage available on both clusters was very large 
(many GB of free memory).

Here is the error message when running with np=2 and an  array dimension of 
idim_1=5  for the integer*4 array allocated per shared window
on the compute node of Cluster5:
In that case, the error occurred at the 723-th shared window, which is the 1st 
badly allocated window in that case:
(722 successfully allocated shared windows * 5 array elements * 4 Bytes/el. 
= 144.4 MB)


[1,0]: on nodemaster: iwin= 722 :
[1,0]:  total storage [MByte] alloc. in shared windows so far:   
144.4000
[1,0]: === allocation of shared window no. iwin= 723
[1,0]:  starting now with idim_1=   5
[1,0]: on nodemaster for iwin= 723 : before writing on 
shared mem
[1,0]:[r5i5n13:12597] *** Process received signal ***
[1,0]:[r5i5n13:12597] Signal: Bus error (7)
[1,0]:[r5i5n13:12597] Signal code: Non-existant physical address (2)
[1,0]:[r5i5n13:12597] Failing at address: 0x7fffe08da000
[1,0]:[r5i5n13:12597] [ 0] 
[1,0]:/lib64/libpthread.so.0(+0xf800)[0x76d67800]
[1,0]:[r5i5n13:12597] [ 1] ./a.out[0x408a8b]
[1,0]:[r5i5n13:12597] [ 2] ./a.out[0x40800c]
[1,0]:[r5i5n13:12597] [ 3] 
[1,0]:/lib64/libc.so.6(__libc_start_main+0xe6)[0x769fec36]
[1,0]:[r5i5n13:12597] [ 4] [1,0]:./a.out[0x407f09]
[1,0]:[r5i5n13:12597] *** End of error message ***
[1,1]:forrtl: error (78): process killed (SIGTERM)
[1,1]:Image  PCRoutineLine  
  Source
[1,1]:libopen-pal.so.6   74B74580  Unknown   
Unknown  Unknown
[1,1]:libmpi.so.177267F3E  Unknown   
Unknown  Unknown
[1,1]:libmpi.so.17733B555  Unknown   
Unknown  Unknown
[1,1]:libmpi.so.17727DFFD  Unknown   
Unknown  Unknown
[1,1]:libmpi_mpifh.so.2  7779BA03  Unknown   
Unknown  Unknown
[1,1]:a.out  00408D15  Unknown   
Unknown  Unknown
[1,1]:a.out  0040800C  Unknown   
Unknown  Unknown
[1,1]:libc.so.6  769FEC36  Unknown   
Unknown  Unknown
[1,1]:a.out  00407F09  Unknown   
Unknown  Unknown
--
mpiexec noticed that process rank 0 with PID 12597 on node r5i5n13 exited on 
signal 7 (Bus error).
--


The small Ftn-testprogram was built by
  mpif90 sharedmemtest.f90
  mpiexec -np 2 -bind-to core -tag-output ./a.out

Why does it work on the Laki  (both on login-node and on a compute node)  as 
well as on the login-node of Cluster5,
but fails on an compute node of Cluster5?

Greetings
   Michael Rachner





Re: [OMPI users] OMPI users] low CPU utilization with OpenMPI

2014-10-24 Thread Gilles Gouaillardet
Can you also check there is no cpu binding issue (several mpi tasks and/or 
OpenMP threads if any, bound to the same core and doing time sharing ?
A simple way to check that is to log into a compute node, run top and then 
press 1 f j
If some cores have higher usage than others, you are likely doing time sharing.
An other option is to disable cpu binding (ompi and openmp if any) and see if 
things get better
(This is suboptimal but still better than time sharing)

"Jeff Squyres (jsquyres)"  wrote:
>- Is /tmp on that machine on NFS or local?
>
>- Have you looked at the text of the help message that came out before the "9 
>more processes have sent help message help-opal-shmem-mmap.txt / mmap on nfs" 
>message?  It should contain details about what the problematic NFS directory 
>is.
>
>- Do you know that it's MPI that is causing this low CPU utilization?
>
>- You mentioned other MPI implementations; have you tested with them to see if 
>they get better CPU utilization?
>
>- What happens if you run this application on a single machine, with no 
>network messaging?
>
>- Do you know what specifically in your application is slow?  I.e., have you 
>done any instrumentation to see what steps / API calls are running slowly, and 
>then tried to figure out why?
>
>- Do you have blocking message patterns that might operate well in shared 
>memory, but expose the inefficiencies of its algorithms/design when it moves 
>to higher-latency transports?
>
>- How long does your application run for?
>
>I ask these questions because MPI applications tend to be quite complicated. 
>Sometimes it's the application itself that is the cause of slowdown / 
>inefficiencies.
>
>
>
>On Oct 23, 2014, at 9:29 PM, Vinson Leung  wrote:
>
>> Later I change another machine and set the TMPDIR to default /tmp, but the 
>> problem (low CPU utilization under 20%) still occur :<
>> 
>> Vincent
>> 
>> On Thu, Oct 23, 2014 at 10:38 PM, Jeff Squyres (jsquyres) 
>>  wrote:
>> If normal users can't write to /tmp (or if /tmp is an NFS-mounted 
>> filesystem), that's the underlying problem.
>> 
>> @Vinson -- you should probably try to get that fixed.
>> 
>> 
>> 
>> On Oct 23, 2014, at 10:35 AM, Joshua Ladd  wrote:
>> 
>> > It's not coming from OSHMEM but from the OPAL "shmem" framework. You are 
>> > going to get terrible performance - possibly slowing to a crawl having all 
>> > processes open their backing files for mmap on NSF. I think that's the 
>> > error that he's getting.
>> >
>> >
>> > Josh
>> >
>> > On Thu, Oct 23, 2014 at 6:06 AM, Vinson Leung  
>> > wrote:
>> > HI, Thanks for your reply:)
>> > I really run an MPI program (compile with OpenMPI and run with "mpirun -n 
>> > 8 .."). My OpenMPI version is 1.8.3 and my program is Gromacs. BTW, 
>> > what is OSHMEM ?
>> >
>> > Best
>> > Vincent
>> >
>> > On Thu, Oct 23, 2014 at 12:21 PM, Ralph Castain  wrote:
>> > From your error message, I gather you are not running an MPI program, but 
>> > rather an OSHMEM one? Otherwise, I find the message strange as it only 
>> > would be emitted from an OSHMEM program.
>> >
>> > What version of OMPI are you trying to use?
>> >
>> >> On Oct 22, 2014, at 7:12 PM, Vinson Leung  wrote:
>> >>
>> >> Thanks for your reply:)
>> >> Follow your advice I tried to set the TMPDIR to /var/tmp and /dev/shm and 
>> >> even reset to /tmp (I get the system permission), the problem still occur 
>> >> (CPU utilization still lower than 20%). I have no idea why and ready to 
>> >> give up OpenMPI instead of using other MPI library.
>> >>
>> >> Old Message-
>> >>
>> >> Date: Tue, 21 Oct 2014 22:21:31 -0400
>> >> From: Brock Palen 
>> >> To: Open MPI Users 
>> >> Subject: Re: [OMPI users] low CPU utilization with OpenMPI
>> >> Message-ID: 
>> >> Content-Type: text/plain; charset=us-ascii
>> >>
>> >> Doing special files on NFS can be weird,  try the other /tmp/ locations:
>> >>
>> >> /var/tmp/
>> >> /dev/shm  (ram disk careful!)
>> >>
>> >> Brock Palen
>> >> www.umich.edu/~brockp
>> >> CAEN Advanced Computing
>> >> XSEDE Campus Champion
>> >> bro...@umich.edu
>> >> (734)936-1985
>> >>
>> >>
>> >>
>> >> > On Oct 21, 2014, at 10:18 PM, Vinson Leung  
>> >> > wrote:
>> >> >
>> >> > Because of permission reason (OpenMPI can not write temporary file to 
>> >> > the default /tmp directory), I change the TMPDIR to my local directory 
>> >> > (export TMPDIR=/home/user/tmp ) and then the MPI program can run. But 
>> >> > the CPU utilization is very low under 20% (8 MPI rank running in Intel 
>> >> > Xeon 8-core CPU).
>> >> >
>> >> > And I also got some message when I run with OpenMPI:
>> >> > [cn3:28072] 9 more processes have sent help message 
>> >> > help-opal-shmem-mmap.txt / mmap on nfs
>> >> > 

[OMPI users] Problem with Yosemite

2014-10-24 Thread Guillaume Houzeaux

Good morning/afternoon/night,

I actualized my OS two days ago from Maverick to Yosemite (Xcode6.1).
I recompiled openmpi-1.4.1, openmpi-1.6.3, openmpi-1.8.3 and get the 
same problem.


What I did:

1. mpif90 test_openmpi.f90 -o test_openmpi.x
2. ./test_openmpi.x

and get

 BEFORE
[bsccs456.int.bsc.es:81492] [[INVALID],INVALID] ORTE_ERROR_LOG: Unknown 
error: -1 in file runtime/orte_globals.c at line 173

input in flex scanner failed


Thanks in advance,

Guillaume

--
/Le camembert, de son fumet de venaison, avait vaincu les odeurs plus 
sourdes du marolles
et du limbourg; il élargissait ses exhalaisons, étouffait les autres 
senteurs sous une
abondance surprenante d'haleines gâtées. Cependant, au milieu de cette 
phrase vigoureuse,
le parmesan jetait par moments un filet mince de flûte champêtre ; 
tandis que les brie y
mettaient des douceurs fades de tambourins humides. Il y eut une reprise 
suffocante du
livarot. Et cette symphonie se tint un moment sur une note aiguë du 
géromé anisé,

prolongée en point d'orgue.

Emile Zola - Le Ventre de Paris

/

Guillaume Houzeaux
Team Leader
Dpt. Computer Applications in Science and Engineering
Barcelona Supercomputing Center (BSC-CNS)
Edificio NEXUS I, Office 204
c) Gran Capitan 2-4
08034 Barcelona, Spain

Tel: +34 93 405 4291
Fax: +34 93 413 7721
Skype user: guillaume_houzeaux_bsc
WWW: CASE department 
ResearcherID: D-4950-2012
Scientific Profile: 
 
 
 





WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer

config.log.bz2
Description: BZip2 compressed data


environment_variables.bz2
Description: BZip2 compressed data


ompi_info.bz2
Description: BZip2 compressed data


ompi_infovompi.bz2
Description: BZip2 compressed data


test_openmpi.f90.bz2
Description: BZip2 compressed data
<>

Re: [OMPI users] New ib locked pages behavior?

2014-10-24 Thread Jeff Squyres (jsquyres)
On Oct 22, 2014, at 3:37 AM, r...@q-leap.de wrote:

> I've commented in detail on this (non-)issue on 2014-08-20:
> 
> http://www.open-mpi.org/community/lists/users/2014/08/25090.php
> 
> A change in the FAQ and a fix in the code would really be nice
> at this stage.

Thanks for the reminder; I've pinged some folks to update the FAQ.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] low CPU utilization with OpenMPI

2014-10-24 Thread Jeff Squyres (jsquyres)
- Is /tmp on that machine on NFS or local?

- Have you looked at the text of the help message that came out before the "9 
more processes have sent help message help-opal-shmem-mmap.txt / mmap on nfs" 
message?  It should contain details about what the problematic NFS directory is.

- Do you know that it's MPI that is causing this low CPU utilization?

- You mentioned other MPI implementations; have you tested with them to see if 
they get better CPU utilization?

- What happens if you run this application on a single machine, with no network 
messaging?

- Do you know what specifically in your application is slow?  I.e., have you 
done any instrumentation to see what steps / API calls are running slowly, and 
then tried to figure out why?

- Do you have blocking message patterns that might operate well in shared 
memory, but expose the inefficiencies of its algorithms/design when it moves to 
higher-latency transports?

- How long does your application run for?

I ask these questions because MPI applications tend to be quite complicated. 
Sometimes it's the application itself that is the cause of slowdown / 
inefficiencies.



On Oct 23, 2014, at 9:29 PM, Vinson Leung  wrote:

> Later I change another machine and set the TMPDIR to default /tmp, but the 
> problem (low CPU utilization under 20%) still occur :<
> 
> Vincent
> 
> On Thu, Oct 23, 2014 at 10:38 PM, Jeff Squyres (jsquyres) 
>  wrote:
> If normal users can't write to /tmp (or if /tmp is an NFS-mounted 
> filesystem), that's the underlying problem.
> 
> @Vinson -- you should probably try to get that fixed.
> 
> 
> 
> On Oct 23, 2014, at 10:35 AM, Joshua Ladd  wrote:
> 
> > It's not coming from OSHMEM but from the OPAL "shmem" framework. You are 
> > going to get terrible performance - possibly slowing to a crawl having all 
> > processes open their backing files for mmap on NSF. I think that's the 
> > error that he's getting.
> >
> >
> > Josh
> >
> > On Thu, Oct 23, 2014 at 6:06 AM, Vinson Leung  
> > wrote:
> > HI, Thanks for your reply:)
> > I really run an MPI program (compile with OpenMPI and run with "mpirun -n 8 
> > .."). My OpenMPI version is 1.8.3 and my program is Gromacs. BTW, what 
> > is OSHMEM ?
> >
> > Best
> > Vincent
> >
> > On Thu, Oct 23, 2014 at 12:21 PM, Ralph Castain  wrote:
> > From your error message, I gather you are not running an MPI program, but 
> > rather an OSHMEM one? Otherwise, I find the message strange as it only 
> > would be emitted from an OSHMEM program.
> >
> > What version of OMPI are you trying to use?
> >
> >> On Oct 22, 2014, at 7:12 PM, Vinson Leung  wrote:
> >>
> >> Thanks for your reply:)
> >> Follow your advice I tried to set the TMPDIR to /var/tmp and /dev/shm and 
> >> even reset to /tmp (I get the system permission), the problem still occur 
> >> (CPU utilization still lower than 20%). I have no idea why and ready to 
> >> give up OpenMPI instead of using other MPI library.
> >>
> >> Old Message-
> >>
> >> Date: Tue, 21 Oct 2014 22:21:31 -0400
> >> From: Brock Palen 
> >> To: Open MPI Users 
> >> Subject: Re: [OMPI users] low CPU utilization with OpenMPI
> >> Message-ID: 
> >> Content-Type: text/plain; charset=us-ascii
> >>
> >> Doing special files on NFS can be weird,  try the other /tmp/ locations:
> >>
> >> /var/tmp/
> >> /dev/shm  (ram disk careful!)
> >>
> >> Brock Palen
> >> www.umich.edu/~brockp
> >> CAEN Advanced Computing
> >> XSEDE Campus Champion
> >> bro...@umich.edu
> >> (734)936-1985
> >>
> >>
> >>
> >> > On Oct 21, 2014, at 10:18 PM, Vinson Leung  
> >> > wrote:
> >> >
> >> > Because of permission reason (OpenMPI can not write temporary file to 
> >> > the default /tmp directory), I change the TMPDIR to my local directory 
> >> > (export TMPDIR=/home/user/tmp ) and then the MPI program can run. But 
> >> > the CPU utilization is very low under 20% (8 MPI rank running in Intel 
> >> > Xeon 8-core CPU).
> >> >
> >> > And I also got some message when I run with OpenMPI:
> >> > [cn3:28072] 9 more processes have sent help message 
> >> > help-opal-shmem-mmap.txt / mmap on nfs
> >> > [cn3:28072] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> >> > help / error messages
> >> >
> >> > Any idea?
> >> > Thanks
> >> >
> >> > VIncent


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] which info is needed for SIGSEGV in Java foropenmpi-dev-124-g91e9686 on Solaris

2014-10-24 Thread Gilles Gouaillardet
Siegmar,

how did you configure openmpi ? which java version did you use ?

i just found a regression and you currently have to explicitly add
CFLAGS=-D_REENTRANT CPPFLAGS=-D_REENTRANT
to your configure command line

if you want to debug this issue (i cannot reproduce it on a solaris 11
x86 virtual machine)
you can apply the attached patch, and make sure you configure with
--enable-debug and run

OMPI_ATTACH=1 mpiexec -n 1 java InitFinalizeMain

then you will need to attach the *java* process with gdb, set the _dbg
local variable to zero and continue
you should get a clean stack trace and hopefully we will be able to help

Cheers,

Gilles

On 2014/10/24 0:03, Siegmar Gross wrote:
> Hello Oscar,
>
> do you have time to look into my problem? Probably Takahiro has a
> point and gdb behaves differently on Solaris and Linux, so that
> the differing outputs have no meaning. I tried to debug my Java
> program, but without success so far, because I wasn't able to get
> into the Java program to set a breakpoint or to see the code. Have
> you succeeded to debug a mpiJava program? If so, how must I call
> gdb (I normally use "gdb mipexec" and then "run -np 1 java ...")?
> What can I do to get helpful information to track the error down?
> I have attached the error log file. Perhaps you can see if something
> is going wrong with the Java interface. Thank you very much for your
> help and any hints for the usage of gdb with mpiJava in advance.
> Please let me know if I can provide anything else.
>
>
> Kind regards
>
> Siegmar
>
>
>>> I think that it must have to do with MPI, because everything
>>> works fine on Linux and my Java program works fine with an older
>>> MPI version (openmpi-1.8.2a1r31804) as well.
>> Yes. I also think it must have to do with MPI.
>> But java process side, not mpiexec process side.
>>
>> When you run Java MPI program via mpiexec, a mpiexec process
>> process launch a java process. When the java process (your
>> Java program) calls a MPI method, native part (written in C/C++)
>> of the MPI library is called. It runs in java process, not in
>> mpiexec process. I suspect that part.
>>
>>> On Solaris things are different.
>> Are you saying the following difference?
>> After this line,
>>> 881 ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_INIT);
>> Linux shows
>>> orte_job_state_to_str (state=1)
>>> at ../../openmpi-dev-124-g91e9686/orte/util/error_strings.c:217
>>> 217 switch(state) {
>> but Solaris shows
>>> orte_util_print_name_args (name=0x100118380 )
>>> at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:122
>>> 122 if (NULL == name) {
>> Each macro is defined as:
>>
>> #define ORTE_ACTIVATE_JOB_STATE(j, s)   \
>> do {\
>> orte_job_t *shadow=(j); \
>> opal_output_verbose(1, orte_state_base_framework.framework_output, \
>> "%s ACTIVATE JOB %s STATE %s AT %s:%d",  \
>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \
>> (NULL == shadow) ? "NULL" : \
>> ORTE_JOBID_PRINT(shadow->jobid), \
>> orte_job_state_to_str((s)), \
>> __FILE__, __LINE__); \
>> orte_state.activate_job_state(shadow, (s)); \
>> } while(0);
>>
>> #define ORTE_NAME_PRINT(n) \
>> orte_util_print_name_args(n)
>>
>> #define ORTE_JOBID_PRINT(n) \
>> orte_util_print_jobids(n)
>>
>> I'm not sure, but I think the gdb on Solaris steps into
>> orte_util_print_name_args, but gdb on Linux doesn't step into
>> orte_util_print_name_args and orte_util_print_jobids for some
>> reason, or orte_job_state_to_str is evaluated before them.
>>
>> So I think it's not an important difference.
>>
>> You showed the following lines.
> orterun (argc=5, argv=0x7fffe0d8)
> at 
> ../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c:1084
> 1084while (orte_event_base_active) {
> (gdb) 
> 1085opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
> (gdb) 
>> I'm not familiar with this code but I think this part (in mpiexec
>> process) is only waiting the java process to terminate (normally
>> or abnormally). So I think the problem is not in a mpiexec process
>> but in a java process.
>>
>> Regards,
>> Takahiro
>>
>>> Hi Takahiro,
>>>
 mpiexec and java run as distinct processes. Your JRE message
 says java process raises SEGV. So you should trace the java
 process, not the mpiexec process. And more, your JRE message
 says the crash happened outside the Java Virtual Machine in
 native code. So usual Java program debugger is useless.
 You should trace native code part of the java process.