Re: [OMPI users] Where is the error? (MPI program in fortran)

2014-04-17 Thread Gus Correa

Hi Oscar

As Ralph suggested, the problem is indeed a memory access violation,
a typical violation of array bounds.
Not really an MPI or OpenMPI problem to be addressed
by this mailing list.

Your ran2 function has a memory violation bug.
It declares dimension ir(1000),
but, the algorithm generates indexes j for that array above 1000.
Here is a sample:

[1,1]: j =   72
[1,1]: j =  686
[1,1]: j = 1353

By the way, although a comment in the program says so,
other than the name, ran2 certainly isn't the ran2
algorithm in Numerical Recipes.
If you want to use a random number generator from Num. Rec.,
the books and the algorithms are available online
(ran2 is on p. 272, ch. 7, Numerical Recipes in Fortran 77 or 90):

http://www.nr.com/oldverswitcher.html

As a general suggestion, you may get fewer bugs in Fortran if you drop
all implicit variable declarations in the program code
and replace them by explicit declarations
(and add "implicit none" to all program units, to play safe).
Implicit variable declarations are a big source of bugs.

I hope this helps,
Gus Correa

PS - If you are at UFBA, send my hello to Milton Porsani, please.

On 04/17/2014 02:01 PM, Oscar Mojica wrote:

Hello guys

I used the command

ulimit -s unlimited

and got

stack size  (kbytes, -s) unlimited

but when I ran the program got the same error. So I used the gdb
debugger, I compiled using

mpif90 -g -o mpivfsa_versao2.f  exe

I ran the program and then I ran gdb with both the executable and the
core file name as arguments and got the following

Program received signal SIGSEGV, Segmentation fault.
0x2b59b54c in free () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) backtrace
#0  0x2b59b54c in free () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00406801 in inv_grav3d_vfsa () at mpivfsa_versao2.f:131
#2  0x00406b88 in main (argc=1, argv=0x7fffe387) at
mpivfsa_versao2.f:9
#3  0x2b53976d in __libc_start_main () from
/lib/x86_64-linux-gnu/libc.so.6
#4  0x00401399 in _start ()

These are the lines

9use mpi
131   deallocate(zv,xrec,yrec,xprm,yprm)

I think the problem is not memory, the problem is related to MPI

Which could be the error?

_Oscar Fabian Mojica Ladino_
Geologist M.S. in  Geophysics


 > From: o_moji...@hotmail.com
 > Date: Wed, 16 Apr 2014 15:17:51 -0300
 > To: us...@open-mpi.org
 > Subject: Re: [OMPI users] Where is the error? (MPI program in fortran)
 >
 > Gus
 > It is a single machine and i have installed Ubuntu 12.04 LTS. I left
my computer in the college but I will try to follow your advice when I
can and tell you about it.
 >
 > Thanks
 >
 > Enviado desde mi iPad
 >
 > > El 16/04/2014, a las 14:17, "Gus Correa" 
escribió:
 > >
 > > Hi Oscar
 > >
 > > This is a long shot, but maybe worth trying.
 > > I am assuming you're using Linux, or some form or Unix, right?
 > >
 > > You may try to increase the stack size.
 > > The default in Linux is often too small for large programs.
 > > Sometimes this may cause a segmentation fault, even if the
 > > program is correct.
 > >
 > > You can check what you have with:
 > >
 > > ulimit -a (bash)
 > >
 > > or
 > >
 > > limit (csh or tcsh)
 > >
 > > Then set it to a larger number or perhaps to unlimited,
 > > e.g.:
 > >
 > > ulimit -s unlimited
 > >
 > > or
 > >
 > > limit stacksize unlimited
 > >
 > > You didn't say anything about the computer(s) you are using.
 > > Is this a single machine, a cluster, something else?
 > >
 > > Anyway, resetting the statck size may depend a bit on what you
 > > have in /etc/security/limits.conf,
 > > and whether it allows you to increase the stack size.
 > > If it is a single computer that you have root access, you may
 > > do it yourself.
 > > There are other limits worth increasing (number of open files,
 > > max locked memory).
 > > For instance, this could go in limits.conf:
 > >
 > > * - memlock -1
 > > * - stack -1
 > > * - nofile 4096
 > >
 > > See 'man limits.conf' for details.
 > >
 > > If it is a cluster, and this should be set on all nodes,
 > > and you may need to ask your system administrator to do it.
 > >
 > > I hope this helps,
 > > Gus Correa
 > >
 > >> On 04/16/2014 11:24 AM, Gus Correa wrote:
 > >>> On 04/16/2014 08:30 AM, Oscar Mojica wrote:
 > >>> How would be the command line to compile with the option -g ? What
 > >>> debugger can I use?
 > >>> Thanks
 > >>>
 > >>
 > >> Replace any optimization flags (-O2, or similar) by -g.
 > >> Check if your compiler has the -traceback flag or similar
 > >> (man compiler-name).
 > >>
 > >> The gdb debugger is normally available on Linux (or you can install it
 > >> with yum, apt-get, etc). An alternative is ddd, with a GUI (can
also be
 > >> installed from yum, etc).
 > >> If you use a commercial compiler you may have a debugger with a GUI.
 > >>
 > >>> Enviado desde mi iPad
 > >>>
 >  El 15/04/2014, a las 18:20, "Gus Correa" 
 >  escribió:

Re: [OMPI users] Where is the error? (MPI program in fortran)

2014-04-17 Thread Jeff Squyres (jsquyres)
Sounds like you're freeing memory that does not belong to you.  Or you have 
some kind of memory corruption somehow.


On Apr 17, 2014, at 2:01 PM, Oscar Mojica  wrote:

> Hello guys
> 
> I used the command 
> 
> ulimit -s unlimited
> 
> and got 
> 
> stack size  (kbytes, -s) unlimited
> 
> but when I ran the program got the same error. So I used the gdb debugger, I 
> compiled using 
> 
> mpif90 -g -o mpivfsa_versao2.f  exe
> 
> I ran the program and then I ran gdb with both the executable and the core 
> file name as arguments and got the following
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x2b59b54c in free () from /lib/x86_64-linux-gnu/libc.so.6
> (gdb) backtrace
> #0  0x2b59b54c in free () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00406801 in inv_grav3d_vfsa () at mpivfsa_versao2.f:131
> #2  0x00406b88 in main (argc=1, argv=0x7fffe387) at 
> mpivfsa_versao2.f:9
> #3  0x2b53976d in __libc_start_main () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x00401399 in _start ()
> 
> These are the lines
> 
> 9use mpi
> 131   deallocate(zv,xrec,yrec,xprm,yprm)
> 
> I think the problem is not memory, the problem is related to MPI
> 
> Which could be the error?
> 
> Oscar Fabian Mojica Ladino
> Geologist M.S. in  Geophysics
> 
> 
> > From: o_moji...@hotmail.com
> > Date: Wed, 16 Apr 2014 15:17:51 -0300
> > To: us...@open-mpi.org
> > Subject: Re: [OMPI users] Where is the error? (MPI program in fortran)
> > 
> > Gus
> > It is a single machine and i have installed Ubuntu 12.04 LTS. I left my 
> > computer in the college but I will try to follow your advice when I can and 
> > tell you about it.
> > 
> > Thanks 
> > 
> > Enviado desde mi iPad
> > 
> > > El 16/04/2014, a las 14:17, "Gus Correa"  
> > > escribió:
> > > 
> > > Hi Oscar
> > > 
> > > This is a long shot, but maybe worth trying.
> > > I am assuming you're using Linux, or some form or Unix, right?
> > > 
> > > You may try to increase the stack size.
> > > The default in Linux is often too small for large programs.
> > > Sometimes this may cause a segmentation fault, even if the
> > > program is correct.
> > > 
> > > You can check what you have with:
> > > 
> > > ulimit -a (bash)
> > > 
> > > or
> > > 
> > > limit (csh or tcsh)
> > > 
> > > Then set it to a larger number or perhaps to unlimited,
> > > e.g.:
> > > 
> > > ulimit -s unlimited
> > > 
> > > or
> > > 
> > > limit stacksize unlimited
> > > 
> > > You didn't say anything about the computer(s) you are using.
> > > Is this a single machine, a cluster, something else?
> > > 
> > > Anyway, resetting the statck size may depend a bit on what you
> > > have in /etc/security/limits.conf,
> > > and whether it allows you to increase the stack size.
> > > If it is a single computer that you have root access, you may
> > > do it yourself.
> > > There are other limits worth increasing (number of open files,
> > > max locked memory).
> > > For instance, this could go in limits.conf:
> > > 
> > > * - memlock -1
> > > * - stack -1
> > > * - nofile 4096
> > > 
> > > See 'man limits.conf' for details.
> > > 
> > > If it is a cluster, and this should be set on all nodes,
> > > and you may need to ask your system administrator to do it.
> > > 
> > > I hope this helps,
> > > Gus Correa
> > > 
> > >> On 04/16/2014 11:24 AM, Gus Correa wrote:
> > >>> On 04/16/2014 08:30 AM, Oscar Mojica wrote:
> > >>> How would be the command line to compile with the option -g ? What
> > >>> debugger can I use?
> > >>> Thanks
> > >>> 
> > >> 
> > >> Replace any optimization flags (-O2, or similar) by -g.
> > >> Check if your compiler has the -traceback flag or similar
> > >> (man compiler-name).
> > >> 
> > >> The gdb debugger is normally available on Linux (or you can install it
> > >> with yum, apt-get, etc). An alternative is ddd, with a GUI (can also be
> > >> installed from yum, etc).
> > >> If you use a commercial compiler you may have a debugger with a GUI.
> > >> 
> > >>> Enviado desde mi iPad
> > >>> 
> >  El 15/04/2014, a las 18:20, "Gus Correa" 
> >  escribió:
> >  
> >  Or just compiling with -g or -traceback (depending on the compiler) 
> >  will
> >  give you more information about the point of failure
> >  in the error message.
> >  
> > > On 04/15/2014 04:25 PM, Ralph Castain wrote:
> > > Have you tried using a debugger to look at the resulting core file? It
> > > will probably point you right at the problem. Most likely a case of
> > > overrunning some array when #temps > 5
> > > 
> > > 
> > > 
> > > 
> > > On Tue, Apr 15, 2014 at 10:46 AM, Oscar Mojica  > > > wrote:
> > > 
> > > Hello everybody
> > > 
> > > I implemented a parallel simulated annealing algorithm in fortran.
> > > The algorithm is describes 

Re: [OMPI users] Where is the error? (MPI program in fortran)

2014-04-17 Thread Oscar Mojica
Hello guys
I used the command 
ulimit -s unlimited
and got 
stack size  (kbytes, -s) unlimited
but when I ran the program got the same error. So I used the gdb debugger, I 
compiled using 
mpif90 -g -o mpivfsa_versao2.f  exe
I ran the program and then I ran gdb with both the executable and the core file 
name as arguments and got the following
Program received signal SIGSEGV, Segmentation fault.0x2b59b54c in free 
() from /lib/x86_64-linux-gnu/libc.so.6(gdb) backtrace#0  0x2b59b54c in 
free () from /lib/x86_64-linux-gnu/libc.so.6#1  0x00406801 in 
inv_grav3d_vfsa () at mpivfsa_versao2.f:131#2  0x00406b88 in main 
(argc=1, argv=0x7fffe387) at mpivfsa_versao2.f:9#3  0x2b53976d in 
__libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6#4  0x00401399 
in _start ()
These are the lines
9use mpi131   deallocate(zv,xrec,yrec,xprm,yprm)
I think the problem is not memory, the problem is related to MPI
Which could be the error?
Oscar Fabian Mojica Ladino
Geologist M.S. in  Geophysics


> From: o_moji...@hotmail.com
> Date: Wed, 16 Apr 2014 15:17:51 -0300
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Where is the error? (MPI program in fortran)
> 
> Gus
> It is a single machine and i have installed Ubuntu 12.04 LTS. I left my 
> computer in the college but  I will try to follow your advice when I can and 
> tell you about it.
> 
> Thanks 
> 
> Enviado desde mi iPad
> 
> > El 16/04/2014, a las 14:17, "Gus Correa"  escribió:
> > 
> > Hi Oscar
> > 
> > This is a long shot, but maybe worth trying.
> > I am assuming you're using Linux, or some form or Unix, right?
> > 
> > You may try to increase the stack size.
> > The default in Linux is often too small for large programs.
> > Sometimes this may cause a segmentation fault, even if the
> > program is correct.
> > 
> > You can check what you have with:
> > 
> > ulimit -a(bash)
> > 
> > or
> > 
> > limit (csh or tcsh)
> > 
> > Then set it to a larger number or perhaps to unlimited,
> > e.g.:
> > 
> > ulimit -s unlimited
> > 
> > or
> > 
> > limit stacksize unlimited
> > 
> > You didn't say anything about the computer(s) you are using.
> > Is this a single machine, a cluster, something else?
> > 
> > Anyway, resetting the statck size may depend a bit on what you
> > have in /etc/security/limits.conf,
> > and whether it allows you to increase the stack size.
> > If it is a single computer that you have root access, you may
> > do it yourself.
> > There are other limits worth increasing (number of open files,
> > max locked memory).
> > For instance, this could go in limits.conf:
> > 
> > *   -   memlock -1
> > *   -   stack   -1
> > *   -   nofile  4096
> > 
> > See 'man limits.conf' for details.
> > 
> > If it is a cluster, and this should be set on all nodes,
> > and you may need to ask your system administrator to do it.
> > 
> > I hope this helps,
> > Gus Correa
> > 
> >> On 04/16/2014 11:24 AM, Gus Correa wrote:
> >>> On 04/16/2014 08:30 AM, Oscar Mojica wrote:
> >>> How would be the command line to compile with the option -g ? What
> >>> debugger can I use?
> >>> Thanks
> >>> 
> >> 
> >> Replace any optimization flags (-O2, or similar) by -g.
> >> Check if your compiler has the -traceback flag or similar
> >> (man compiler-name).
> >> 
> >> The gdb debugger is normally available on Linux (or you can install it
> >> with yum, apt-get, etc).  An alternative is ddd, with a GUI (can also be
> >> installed from yum, etc).
> >> If you use a commercial compiler you may have a debugger with a GUI.
> >> 
> >>> Enviado desde mi iPad
> >>> 
>  El 15/04/2014, a las 18:20, "Gus Correa" 
>  escribió:
>  
>  Or just compiling with -g or -traceback (depending on the compiler) will
>  give you more information about the point of failure
>  in the error message.
>  
> > On 04/15/2014 04:25 PM, Ralph Castain wrote:
> > Have you tried using a debugger to look at the resulting core file? It
> > will probably point you right at the problem. Most likely a case of
> > overrunning some array when #temps > 5
> > 
> > 
> > 
> > 
> > On Tue, Apr 15, 2014 at 10:46 AM, Oscar Mojica  > > wrote:
> > 
> >Hello everybody
> > 
> >I implemented a parallel simulated annealing algorithm in fortran.
> >  The algorithm is describes as follows
> > 
> >1. The MPI program initially generates P processes that have rank
> >0,1,...,P-1.
> >2. The MPI program generates a starting point and sends it  for all
> >processes set T=T0
> >3. At the current temperature T, each process begins to execute
> >iterative operations
> >4. At end of iterations, process with rank 0 is responsible for
> >collecting the solution obatined by
> >  

Re: [OMPI users] Conflicts between jobs running on the same node

2014-04-17 Thread Ralph Castain
Unfortunately, each execution of mpirun has no knowledge of where the procs
have been placed and bound by another execution of mpirun. So what is
happening is that the procs of the two jobs are being bound to the same
cores, thus causing contention.

If you truly want to run two jobs at the same time on the same nodes, then
you should add "--bind-to none" on the cmd line. Each job will see a
performance impact relative to running bound on their own, but the jobs
will run much better if they are sharing nodes.

Ralph



On Thu, Apr 17, 2014 at 10:37 AM, Alfonso Sanchez <
alfonso.sanc...@tyndall.ie> wrote:

> Hi all,
>
> I've compiled OMPI 1.8 on a x64 linux cluster using the PGI compilers
> v14.1 (I've tried it with PGI v11.10 and get the same result). I'm able to
> compile with the resulting mpicc/mpifort/etc. When running the codes,
> everything seems to be working fine when there's only one job running on a
> given computing node. However, whenever a second job gets assigned the same
> computing node, the CPU load of every process gets divided by 2. I'm using
> pbs torque. As an example:
>
> -Submit jobA using torque to node1 using mpirun -n 4
>
> -All 4 rocesses of jobA show 100% CPU load.
>
> -Submit jobB using torque to node1 using mpirun -n 4
>
> -All 8 processes ( 4 from jobA & 4 from jobB ) show 50% CPU load.
>
> Moreover, whilst jobA/jobB would run in 30 mins by itself; when both jobs
> are on the same node they've gone 14 hrs without completing.
>
> I'm attaching config.log & the output of ompi_info --all (bzipped)
>
> Some more info:
>
> $> ompi_info | grep tm
>
> MCA ess: tm (MCA v2.0, API v3.0, Component v1.8)
> MCA plm: tm (MCA v2.0, API v2.0, Component v1.8)
> MCA ras: tm (MCA v2.0, API v2.0, Component v1.8)
>
> Sorry if this is a common problem but I've tried searching for posts
> discussing similar problems but haven't been able to find any.
>
> Thanks for your help,
> Alfonso
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] Conflicts between jobs running on the same node

2014-04-17 Thread Alfonso Sanchez
Hi all,

I've compiled OMPI 1.8 on a x64 linux cluster using the PGI compilers v14.1 
(I've tried it with PGI v11.10 and get the same result). I'm able to compile 
with the resulting mpicc/mpifort/etc. When running the codes, everything seems 
to be working fine when there's only one job running on a given computing node. 
However, whenever a second job gets assigned the same computing node, the CPU 
load of every process gets divided by 2. I'm using pbs torque. As an example:

-Submit jobA using torque to node1 using mpirun -n 4

-All 4 rocesses of jobA show 100% CPU load.

-Submit jobB using torque to node1 using mpirun -n 4

-All 8 processes ( 4 from jobA & 4 from jobB ) show 50% CPU load.

Moreover, whilst jobA/jobB would run in 30 mins by itself; when both jobs are 
on the same node they've gone 14 hrs without completing.

I'm attaching config.log & the output of ompi_info --all (bzipped)

Some more info:

$> ompi_info | grep tm

MCA ess: tm (MCA v2.0, API v3.0, Component v1.8)
MCA plm: tm (MCA v2.0, API v2.0, Component v1.8)
MCA ras: tm (MCA v2.0, API v2.0, Component v1.8)

Sorry if this is a common problem but I've tried searching for posts discussing 
similar problems but haven't been able to find any.

Thanks for your help,
Alfonso

config.log.bz2
Description: config.log.bz2


ompi_output.log.bz2
Description: ompi_output.log.bz2